Weekly TMLR digest for May 05, 2024

4 views
Skip to first unread message

TMLR

unread,
May 4, 2024, 8:00:18 PMMay 4
to tmlr-annou...@googlegroups.com


New certifications
==================

Survey Certification: A Survey of Temporal Credit Assignment in Deep Reinforcement Learning

Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Laura Toni

https://openreview.net/forum?id=bNtr6SLgZf

---


Expert Certification: Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks

Mufan Bill Li, Mihai Nica

https://openreview.net/forum?id=iRDwUXYsSJ

---


Expert Certification: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexander A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Kenealy, Kevin Swersky, Kshiteej Mahajan, Laura A Culp, Lechao Xiao, Maxwell Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, Noah Fiedel

https://openreview.net/forum?id=lNAyUngGFK

---


Reproducibility Certification: Reproducibility Study of "Robust Fair Clustering: A Novel Fairness Attack and Defense Framework"

Iason Skylitsis, Zheng Feng, Idries Nasim, Camille Niessink

https://openreview.net/forum?id=H1hLNjwrGy

---


Survey Certification: A Unified View of Differentially Private Deep Generative Modeling

Dingfan Chen, Raouf Kerkouche, Mario Fritz

https://openreview.net/forum?id=YgmBD2c9qX

---


Featured Certification: ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers

Junjie Yin, Jiahao Dong, Yingheng Wang, Christopher De Sa, Volodymyr Kuleshov

https://openreview.net/forum?id=r9p9CV52MV

---


Accepted papers
===============


Title: Hybrid Federated Learning for Feature & Sample Heterogeneity: Algorithms and Implementation

Authors: Xinwei Zhang, Wotao Yin, Mingyi Hong, Tianyi Chen

Abstract: Federated learning (FL) is a popular distributed machine learning paradigm dealing with distributed and private data sets. Based on the data partition pattern, FL is often categorized into horizontal, vertical, and hybrid settings. All three settings have many applications, but the hybrid FL remains relatively less explored, because it deals with the challenging situation where {\it both} the feature space and the data samples are {\it heterogeneous}.
This work designs a novel mathematical model that effectively allows the clients to aggregate distributed data with heterogeneous, and possibly overlapping features and samples. Our main idea is to partition each client's model into a feature extractor part and a classifier part, where the former can be used to process the input data, while the latter is used to perform the learning from the extracted features. The heterogeneous feature aggregation is done through building a server model, which assimilates local classifiers and feature extractors through a carefully designed matching mechanism. A communication-efficient algorithm is then designed to train both the client and server models. Finally, we conducted numerical experiments on multiple image classification data sets to validate the performance of the proposed algorithm. To our knowledge, this is the first formulation and algorithm developed for hybrid FL.

URL: https://openreview.net/forum?id=qc2lmWkvk4

---

Title: Fixed Budget Best Arm Identification in Unimodal Bandits

Authors: Debamita Ghosh, Manjesh Kumar Hanawal, Nikola Zlatanov

Abstract: We consider the best arm identification problem in a fixed budget stochastic multi-armed bandit in which arm means exhibit unimodal structure, i.e., there is only one local maximum. We establish that the probability of misidentifying the optimal arm within a budget of $T$ is lower bounded as $\mathcal{O}\left(\exp\left\{-T/\bar{H}\right\}\right)$, where $\bar{H}$ depends on the sub-optimality gaps of arms in the neighborhood of the optimal arm. % where $\bar{H}\leq 2\Delta^{-2}$. In contrast to the lower bound for the unstructured case, the error exponent in this bound does not depend on the number of arms $K$ and is smaller by a factor $\log K$, which captures the gain achievable by exploiting the unimodal structure. We then develop an algorithm named {\it Fixed Budget Best Arm Unimodal Bandits ( FB-BAUB)} that exploits unimodality to achieve the gain. Specifically, we show that the error probability of \algo{} is upper bounded as $\mathcal{O}\left(\log_2 K\exp\left\{-T\Delta^2\right\}\right)$, where $\Delta$ is the gap between the neighboring arms and $\bar{H}\leq 2\Delta^{-2}$. We demonstrate that \algo{} outperforms the state-of-the-art algorithms through extensive simulations. Moreover, \algo{} is parameter-free and simple to implement.

URL: https://openreview.net/forum?id=epcLNhkoEL

---

Title: Restricted Random Pruning at Initialization for High Compression Range

Authors: Hikari Otsuka, Yasuyuki Okoshi, Ángel López García-Arias, Kazushi Kawamura, Thiem Van Chu, Daichi Fujiki, Masato Motomura

Abstract: Pruning at Initialization (PaI) makes training overparameterized neural networks more efficient by reducing the overall computational cost from training to inference. Recent PaI studies showed that random pruning is more effective than ranking-based pruning, which learns connectivity. However, the effectiveness of each pruning method depends on the existence of skip connections and the compression ratio (the before-after pruning parameter ratio). While random pruning performs better than ranking-based pruning on architectures with skip connections, the superiority without skip connections is reversed in the high compression range. This paper proposes Minimum Connection Assurance (MiCA) that achieves higher accuracy than conventional PaI methods for architectures with and without skip connections, regardless of the compression ratio. MiCA preserves the random connection between the layers and maintains the performance at high compression ratios without the costly connection learning that ranking-based pruning requires. Experiments on image classification using CIFAR-10 and CIFAR-100 and node classification using OGBN-ArXiv show that MiCA enhances the compression ratio and accuracy trade-offs compared to existing PaI methods. In VGG-16 with CIFAR-10, MiCA improves the accuracy of random pruning by $27.0\%$ at $10^{4.7}\times$ compression ratio. Furthermore, experimental analysis reveals that increasing the utilization of the nodes through which information flows from the first layer is essential for maintaining high performance at a high compression ratio.

URL: https://openreview.net/forum?id=yf4ciZcgrg

---

Title: Continual HyperTransformer: A Meta-Learner for Continual Few-Shot Learning

Authors: Max Vladymyrov, Andrey Zhmoginov, Mark Sandler

Abstract: We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes. We approach this problem using the recently published HyperTransformer (HT), a Transformer-based hypernetwork that generates specialized task-specific CNN weights directly from the support set. In order to learn from a continual sequence of tasks, we propose to recursively re-use the generated weights as input to the HT for the next task. This way, the generated CNN weights themselves act as a representation of previously learned tasks, and the HT is trained to update these weights so that the new task can be learned without forgetting past tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. We demonstrate that our proposed Continual HyperTransformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios.

URL: https://openreview.net/forum?id=zdtSqZnkx1

---

Title: Federated Learning with Convex Global and Local Constraints

Authors: Chuan He, Le Peng, Ju Sun

Abstract: In practice, many machine learning (ML) problems come with constraints, and their applied domains involve distributed sensitive data that cannot be shared with others, e.g., in healthcare. Collaborative learning in such practical scenarios entails federated learning (FL) for ML problems with constraints, or FL with constraints for short. Despite the extensive developments of FL techniques in recent years, these techniques only deal with unconstrained FL problems or FL problems with simple constraints that are amenable to easy projections. There is little work dealing with FL problems with general constraints. To fill this gap, we take the first step toward building an algorithmic framework for solving FL problems with general constraints. In particular, we propose a new FL algorithm for constrained ML problems based on the proximal augmented Lagrangian (AL) method. %The subproblems of our proposed algorithm are solved by an inexact alternating direction method of multipliers (ADMM). Assuming convex objective and convex constraints plus other mild conditions, we establish the worst-case complexity of the proposed algorithm. Our numerical experiments show the effectiveness of our algorithm in performing Neyman-Pearson classification and fairness-aware learning with nonconvex constraints, in an FL setting.

URL: https://openreview.net/forum?id=qItxVbWyfe

---

Title: Group Fairness in Reinforcement Learning via Multi-Objective Rewards

Authors: Jack Blandin, Ian A. Kash

Abstract: Recent works extend classification group fairness measures to sequential decision processes such as reinforcement learning (RL) by measuring fairness as the difference in decision-maker utility (e.g. accuracy) of each group. This approach suffers when decision-maker utility is not perfectly aligned with group utility, such as in repeat loan applications where a false positive (loan default) impacts the groups (applicants) and decision-maker (lender) by different magnitudes. Some works remedy this by measuring fairness in terms of group utility, typically referred to as their "qualification", but few works offer solutions that yield group qualification equality. Those that do are prone to violating the "no-harm" principle where one or more groups' qualifications are lowered in order to achieve equality. In this work, we characterize this problem space as having three implicit objectives: maximizing decision-maker utility, maximizing group qualification, and minimizing the difference in qualification between groups. We provide a RL policy learning technique that optimizes for these objectives directly by constructing a multi-objective reward function that encodes these objectives as distinct reward signals. Under suitable parameterizations our approach is guaranteed to respect the "no-harm" principle.

URL: https://openreview.net/forum?id=cueEUSG7lE

---

Title: On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models

Authors: Juliette Marrie, Michael Arbel, Julien Mairal, Diane Larlus

Abstract: Large pretrained visual models exhibit remarkable generalization across diverse recognition tasks. Yet, real-world applications often demand compact models tailored to specific problems. Variants of knowledge distillation have been devised for such a purpose, enabling task-specific compact models (the students) to learn from a generic large pretrained one (the teacher). In this paper, we show that the excellent robustness and versatility of recent pretrained models challenge common practices established in the literature, calling for a new set of optimal guidelines for task-specific distillation. To address the lack of samples in downstream tasks, we also show that a variant of Mixup based on stable diffusion complements standard data augmentation. This strategy eliminates the need for engineered text prompts and improves distillation of generic models into streamlined specialized networks.

URL: https://openreview.net/forum?id=oyISaaeHwD

---

Title: Dynamic Online Ensembles of Basis Expansions

Authors: Daniel Waxman, Petar Djuric

Abstract: Practical Bayesian learning often requires (1) online inference, (2) dynamic models, and (3) ensembling over multiple different models. Recent advances have shown how to use random feature approximations to achieve scalable, online ensembling of Gaussian processes with desirable theoretical properties and fruitful applications. One key to these methods' success is the inclusion of a random walk on the model parameters, which makes models dynamic. We show that these methods can be generalized easily to any basis expansion model and that using alternative basis expansions, such as Hilbert space Gaussian processes, often results in better performance. To simplify the process of choosing a specific basis expansion, our method's generality also allows the ensembling of several entirely different models, for example, a Gaussian process and polynomial regression. Finally, we propose a novel method to ensemble static and dynamic models together.

URL: https://openreview.net/forum?id=aVOzWH1Nc5

---

Title: IMEX-Reg: Implicit-Explicit Regularization in the Function Space for Continual Learning

Authors: Prashant Shivaram Bhat, Bharath Chennamkulam Renjith, Elahe Arani, Bahram Zonooz

Abstract: Continual learning (CL) remains one of the long-standing challenges for deep neural networks due to catastrophic forgetting of previously acquired knowledge. Although rehearsal-based approaches have been fairly successful in mitigating catastrophic forgetting, they suffer from overfitting on buffered samples and prior information loss, hindering generalization under low-buffer regimes. Inspired by how humans learn using strong inductive biases, we propose \textbf{IMEX-Reg} to improve the generalization performance of experience rehearsal in CL under low buffer regimes. Specifically, we employ a two-pronged implicit-explicit regularization approach using contrastive representation learning (CRL) and consistency regularization. To further leverage the global relationship between representations learned using CRL, we propose a regularization strategy to guide the classifier toward the activation correlations in the unit hypersphere of the CRL. Our results show that IMEX-Reg significantly improves generalization performance and outperforms rehearsal-based approaches in several CL scenarios. It is also robust to natural and adversarial corruptions with less task-recency bias. Additionally, we provide theoretical insights to support our design decisions further.

URL: https://openreview.net/forum?id=p1a6ruIZCT

---

Title: Improving Subgraph-GNNs via Edge-Level Ego-Network Encodings

Authors: Nurudin Alvarez-Gonzalez, Andreas Kaltenbrunner, Vicenç Gómez

Abstract: We present a novel edge-level ego-network encoding for learning on graphs that can boost Message Passing Graph Neural Networks (MP-GNNs) by providing additional node and edge features or extending message-passing formats. The proposed encoding is sufficient to distinguish Strongly Regular Graphs, a family of challenging 3-WL equivalent graphs. We show theoretically that such encoding is more expressive than node-based sub-graph MP-GNNs. In an empirical evaluation on four benchmarks with 10 graph datasets, our results match or improve previous baselines on expressivity, graph classification, graph regression, and proximity tasks---while reducing memory usage by 18.1x in certain real-world settings.

URL: https://openreview.net/forum?id=N0Sc0KY0AH

---

Title: DP-ImgSyn: Dataset Alignment for Obfuscated, Differentially Private Image Synthesis

Authors: Efstathia Soufleri, Deepak Ravikumar, Kaushik Roy

Abstract: The availability of abundant data has catalyzed the expansion of deep learning vision algorithms. However, certain vision datasets cannot be publicly released due to privacy reasons. Releasing synthetic images instead of private images is a common approach to overcome this issue. A popular method to generate synthetic images is using Generative Adversarial Networks (GANs) with Differential Privacy (DP) guarantees. However, GAN-generated synthetic images are visually similar to private images. This is a severe limitation, particularly when the private dataset depicts visually sensitive and disturbing content. To address this, we propose a non-generative framework, Differentially Private Image Synthesis (DP-ImgSyn), to generate and release synthetic images for image classification tasks. These synthetic images: (1) have DP guarantees, (2) retain the utility of the private images, i.e., a model trained using synthetic images results in similar accuracy as a model trained on private images, (3) the synthetic images are visually dissimilar to private images. DP-ImgSyn consists of the following steps: First, a teacher model is trained on the private images using a DP training algorithm. Second, public images are used as initialization for the synthetic images which are optimized to align them with the private images. The optimization uses the teacher network's batch normalization layer statistics (mean, standard deviation) to inject information about the private images into the synthetic images. Third, the synthetic images and their soft labels, obtained from the teacher model, are released and can be deployed for neural network training on image classification tasks. Our experiments on various image classification datasets show that when using similar DP training mechanisms, our framework performs better than generative techniques (up to $\approx$ 20% in terms of image classification accuracy).

URL: https://openreview.net/forum?id=KleJZ9ZzYw

---

Title: Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Authors: Aleksandar Stanić, Sergi Caelles, Michael Tschannen

Abstract: Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

URL: https://openreview.net/forum?id=WYGiqSVstK

---

Title: From Stability to Chaos: Analyzing Gradient Descent Dynamics in Quadratic Regression

Authors: Xuxing Chen, Krishna Balasubramanian, Promit Ghosal, Bhavya Kumar Agrawalla

Abstract: We conduct a comprehensive investigation into the dynamics of gradient descent using large-order constant step-sizes in the context of quadratic regression models. Within this framework, we reveal that the dynamics can be encapsulated by a specific cubic map, naturally parameterized by the step-size. Through a fine-grained bifurcation analysis concerning the step-size parameter, we delineate five distinct training phases: (1) monotonic, (2) catapult, (3) periodic, (4) chaotic, and (5) divergent, precisely demarcating the boundaries of each phase. As illustrations, we provide examples involving phase retrieval and two-layer neural networks employing quadratic activation functions and constant outer-layers, utilizing orthogonal training data. Our simulations indicate that these five phases also manifest with generic non-orthogonal data. We also empirically investigate the generalization performance when training in the various non-monotonic (and non-divergent) phases. In particular, we observe that performing an ergodic trajectory averaging stabilizes the test error in non-monotonic (and non-divergent) phases.

URL: https://openreview.net/forum?id=Wiklo5VpG7

---

Title: Using Skew to Assess the Quality of GAN-generated Image Features

Authors: Lorenzo Luzi, Helen Jenne, Carlos Ortiz Marrero, Ryan Murray

Abstract: The rapid advancement of Generative Adversarial Networks (GANs) necessitates the need to robustly evaluate these models. Among the established evaluation criteria, the Fréchet Inception Distance (FID) has been widely adopted due to its conceptual simplicity, fast computation time, and strong correlation with human perception. However, FID has inherent limitations, mainly stemming from its assumption that feature embeddings follow a Gaussian distribution, and therefore can be defined by their first two moments. As this does not hold in practice, in this paper we explore the importance of third-moments in image feature data and use this information to define a new measure, which we call the Skew Inception Distance (SID). We prove that SID is a pseudometric on probability distributions, show how it extends FID, and present a practical method for its computation. Our numerical experiments support that SID either tracks with FID or, in some cases, aligns more closely with human perception when evaluating image features of ImageNet data. Our work also shows that principal component analysis can be used to speed up the computation time of both FID and SID. Although we focus on using SID on image features for GAN evaluation, SID is applicable much more generally, including for the evaluation of other generative models.

URL: https://openreview.net/forum?id=Io3jDUC4DP

---

Title: Reproducibility Study of "Explaining RL Decisions with Trajectories"

Authors: Clio Feng, Colin Bot, Bart den Boef, Bart Aaldering

Abstract: This paper reports on the reproducibility study on the paper `Explaining RL Decisions with Trajectories' by Deshmukh et al. (2023). The authors proposed a method to elucidate the decisions of an offline RL agent by attributing them to clusters of trajectories encountered during training. The original paper explored various environments and conducted a human study to gauge real-world performance. Our objective is to validate the effectiveness of their proposed approach. This paper conducted quantitative and qualitative experiments across three environments: a Grid-world, an Atari video game (Seaquest), and a continuous control task from MuJoCo (HalfCheetah). While the authors provided the code for the Grid-world environment, we re-implemented it for the Seaquest and HalfCheetah environments. This work extends the original paper by including trajectory rankings within a cluster, experimenting with alternative trajectory clustering, and expanding the human study. The results affirm the effectiveness of the method, both in its reproduction and in the additional experiments. However, the results of the human study suggest that the method's explanations are more challenging to interpret for humans in more complex environments. Our implementations can be found on GitHub.

URL: https://openreview.net/forum?id=JQoWmeNaC2

---

Title: Identify Ambiguous Tasks Combining Crowdsourced Labels by Weighting Areas Under the Margin

Authors: Tanguy Lefort, Benjamin Charlier, Alexis Joly, Joseph Salmon

Abstract: In supervised learning — for instance in image classification — modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training, generally leveraging a per-worker trust score.
Yet, such workers oriented approaches discard the tasks' ambiguity.
Ambiguous tasks might fool expert workers, which is often harmful for the learning step.
In standard supervised learning settings -- with one label per task -- the Area Under the Margin (AUM) was tailored to identify mislabeled data.
We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted Areas Under the Margin (WAUM).
The WAUM is an average of AUMs weighted according to task-dependent scores.
We show that the WAUM can help discarding ambiguous tasks from the training set, leading to better generalization performance.
We report improvements over existing strategies for learning with a crowd, both on simulated settings, and on real datasets such as CIFAR-10H (a crowdsourced dataset with a high number of answered labels), LabelMe and Music (two datasets with few answered votes).

URL: https://openreview.net/forum?id=raD846nj2q

---

Title: DSI2I: Dense Style for Unpaired Exemplar-based Image-to- Image Translation

Authors: Baran Ozaydin, Tong Zhang, Sabine Susstrunk, Mathieu Salzmann

Abstract: Unpaired exemplar-based image-to-image (UEI2I) translation aims to translate a source
image to a target image domain with the style of a target image exemplar, without ground-
truth input-translation pairs. Existing UEI2I methods represent style using one vector per
image or rely on semantic supervision to define one style vector per object. Here, in contrast,
we propose to represent style as a dense feature map, allowing for a finer-grained transfer
to the source image without requiring any external semantic information. We then rely on
perceptual and adversarial losses to disentangle our dense style and content representations.
To stylize the source content with the exemplar style, we extract unsupervised cross-domain
semantic correspondences and warp the exemplar style to the source content. We demon-
strate the effectiveness of our method on four datasets using standard metrics together with
a localized style metric we propose, which measures style similarity in a class-wise man-
ner. Our results show that the translations produced by our approach are more diverse,
preserve the source content better, and are closer to the exemplars when compared to the
state-of-the-art methods.

URL: https://openreview.net/forum?id=mrJi5kdKA4

---

Title: A Survey of Temporal Credit Assignment in Deep Reinforcement Learning

Authors: Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Laura Toni

Abstract: The Credit Assignment Problem (CAP) refers to the longstanding challenge of Reinforcement Learning agents to associate actions with their long-term consequences. Solving the CAP is a crucial step towards the successful deployment of RL in the real world since most decision problems provide feedback that is noisy, delayed, and with little or no information about the causes. These conditions make it hard to distinguish serendipitous outcomes from those caused by informed decision-making.
However, the mathematical nature of credit and the CAP remains poorly understood and defined.
In this survey, we review the state of the art of Temporal Credit Assignment (CA) in deep RL. We propose a unifying formalism for credit that enables equitable comparisons of state-of-the-art algorithms and improves our understanding of the trade-offs between the various methods. We cast the CAP as the problem of learning the influence of an action over an outcome from a finite amount of experience. We discuss the challenges posed by delayed effects, dilution, and a lack of action influence, and analyse how existing methods aim to address them. Finally, we survey the protocols to evaluate a credit assignment method and suggest ways to diagnose the sources of struggle for different methods.
Overall, this survey provides an overview of the field for new-entry practitioners and researchers, it offers a coherent perspective for scholars looking to expedite the starting stages of a new study on the CAP, and it suggests potential directions for future research.

URL: https://openreview.net/forum?id=bNtr6SLgZf

---

Title: Improving Diffusion Models for Scene Text Editing with Dual Encoders

Authors: Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian L. Price, Shiyu Chang

Abstract: Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. An instruction tuning framework is introduced to train our model to learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our method the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g., italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task. We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability.

URL: https://openreview.net/forum?id=yL15ys5swq

---

Title: Continuous U-Net: Faster, Greater and Noiseless

Authors: Chun-Wun Cheng, Christina Runkel, Lihao Liu, Raymond H. Chan, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero

Abstract: Image segmentation is a fundamental task in image analysis and clinical practice. The current state-of-the-art techniques are based on U-shape type encoder-decoder networks with skip connections called U-Net. Despite the powerful performance reported by existing U-Net type networks, they suffer from several major limitations. These issues include the hard coding of the receptive field size, compromising the performance and computational cost, as well as the fact that they do not account for inherent noise in the data. They have problems associated with discrete layers, and do not offer any theoretical underpinning. In this work we introduce continuous U-Net, a novel family of networks for image segmentation. Firstly, continuous U-Net is a continuous deep neural network that introduces new dynamic blocks modelled by second order ordinary differential equations. Secondly, we provide theoretical guarantees for our network demonstrating faster convergence, higher robustness and less sensitivity to noise. Thirdly, we derive qualitative measures to tailor-made segmentation tasks. We demonstrate, through extensive numerical and visual results, that our model outperforms existing U-Net blocks for several medical image segmentation benchmarking datasets.

URL: https://openreview.net/forum?id=ongi2oe3Fr

---

Title: Decentralized Decoupled Training for Federated Long-Tailed Learning

Authors: Wenkai Yang, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun

Abstract: In the real world, the data samples often follow a long-tailed distribution, which poses a great challenge for Federated Learning (FL). That is, when the data is decentralized and long-tailed, FL may produce a poorly-behaved global model that is severely biased towards the head classes with the majority of the training samples. To settle this issue, decoupled training has recently been introduced to FL. Decoupled training aims to re-balance the biased classifier after the normal instance-balanced training, and has achieved promising results in centralized long-tailed learning. The current study directly adopts the decoupled training idea on the server side by re-training the classifier on a set of pseudo features, due to the unavailability of a global balanced dataset in FL. Unfortunately, this practice restricts the capacity of decoupled training in federated long-tailed learning as the low-quality pseudo features lead to a sub-optimal classifier. In this work, motivated by the distributed characteristic of FL, we propose a decentralized decoupled training mechanism by leveraging the abundant real data stored in the local. Specifically, we integrate the local real data with the global gradient prototypes to form the local balanced datasets, and thus re-balance the classifier during the local training. Furthermore, we introduce a supplementary classifier in the training phase to help model the global data distribution, which addresses the problem of contradictory optimization goals caused by performing classifier re-balancing locally. Extensive experiments show that our method consistently outperforms the existing state-of-the-art methods in various settings. Our code is available at https://github.com/keven980716/Federated_Learning_Experiments.

URL: https://openreview.net/forum?id=hw7inQwRxB

---

Title: FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning

Authors: Peiran Xu, Zeyu Wang, Jieru Mei, Liangqiong Qu, Alan Yuille, Cihang Xie, Yuyin Zhou

Abstract: Federated learning (FL) is an emerging paradigm in machine learning, where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage have yet to be elucidated. In this paper, we systematically investigate the impact of different architectural elements, such as activation functions and normalization layers, on the performance within heterogeneous FL. Through rigorous empirical analyses, we are able to offer the first-of-its-kind general guidance on micro-architecture design principles for heterogeneous FL.

Intriguingly, our findings indicate that with strategic architectural modifications, pure CNNs can achieve a level of robustness that either matches or even exceeds that of ViTs when handling heterogeneous data clients in FL. Additionally, our approach is compatible with existing FL techniques and delivers state-of-the-art solutions across a broad spectrum of FL benchmarks.

URL: https://openreview.net/forum?id=bzTfO4mURl

---

Title: Understanding Sparse Neural Networks from their Topology via Multipartite Graph Representations

Authors: Elia Cunegatti, Matteo Farina, Doina Bucur, Giovanni Iacca

Abstract: Pruning-at-Initialization (PaI) algorithms provide Sparse Neural Networks (SNNs) which are computationally more efficient than their dense counterparts, and try to avoid performance degradation. While much emphasis has been directed towards \emph{how} to prune, we still do not know \emph{what topological metrics} of the SNNs characterize \emph{good performance}. From prior work, we have layer-wise topological metrics by which SNN performance can be predicted: the Ramanujan-based metrics. To exploit these metrics, proper ways to represent network layers via Graph Encodings (GEs) are needed, with Bipartite Graph Encodings (BGEs) being the \emph{de-facto} standard at the current stage. Nevertheless, existing BGEs neglect the impact of the inputs, and do not characterize the SNN in an end-to-end manner. Additionally, thanks to a thorough study of the Ramanujan-based metrics, we discover that they are only as good as the \emph{layer-wise density} as performance predictors, when paired with BGEs. To close both gaps, we design a comprehensive topological analysis for SNNs with both linear and convolutional layers, via (i) a new input-aware Multipartite Graph Encoding (MGE) for SNNs and (ii) the design of new end-to-end topological metrics over the MGE. With these novelties, we show the following: (a) The proposed MGE allows to extract topological metrics that are much better predictors of the accuracy drop than metrics computed from current input-agnostic BGEs; (b) Which metrics are important at different sparsity levels and for different architectures; (c) A mixture of our topological metrics can rank PaI algorithms more effectively than Ramanujan-based metrics.

URL: https://openreview.net/forum?id=Egb0tUZnOY

---

Title: Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks

Authors: Mufan Bill Li, Mihai Nica

Abstract: Recent analyses of neural networks with shaped activations (i.e. the activation function is scaled as the network size grows) have led to scaling limits described by differential equations. However, these results do not a priori tell us anything about ``ordinary'' unshaped networks, where the activation is unchanged as the network size grows. In this article, we find similar differential equation based asymptotic characterization for two types of unshaped networks.

Firstly, we show that the following two architectures converge to the same infinite-depth-and-width limit at initialization:
(i) a fully connected ResNet with a $d^{-1/2}$ factor on the residual branch, where $d$ is the network depth.
(ii) a multilayer perceptron (MLP) with depth $d \ll$ width $n$ and shaped ReLU activation at rate $d^{-1/2}$.

Secondly, for an unshaped MLP at initialization, we derive the first order asymptotic correction to the layerwise correlation. In particular, if $\rho_\ell$ is the correlation at layer $\ell$, then $q_t = \ell^2 (1 - \rho_\ell)$ with $t = \frac{\ell}{n}$ converges to an SDE with a singularity at $t=0$.

These results together provide a connection between shaped and unshaped network architectures, and opens up the possibility of studying the effect of normalization methods and how it connects with shaping activation functions.

URL: https://openreview.net/forum?id=iRDwUXYsSJ

---

Title: Stochastic Direct Search Methods for Blind Resource Allocation

Authors: Juliette Achddou, Olivier Cappé, Aurélien Garivier

Abstract: Motivated by programmatic advertising optimization, we consider the task of sequentially allocating budget across a set of resources. At every time step, a feasible allocation is chosen and only a corresponding random return is observed. The goal is to maximize the cumulative expected sum of returns. This is a realistic model for budget allocation across subdivisions of marketing campaigns, with the objective of maximizing the number of conversions. We study direct search (also known as pattern search) methods for linearly constrained and derivative-free optimization in the presence of noise, which apply in particular to sequential budget allocation. These algorithms, which do not rely on hierarchical partitioning of the resource space, are easy to implement; they respect the operational constraints of resource allocation by avoiding evaluation outside of the feasible domain; and, they are also compatible with warm start by being (approximate) descent algorithms. However, they have not yet been analyzed from the perspective of cumulative regret. We show that direct search methods achieves finite regret in the deterministic and unconstrained case. In the presence of evaluation noise and linear constraints, we propose a simple extension of direct search that achieves a regret upper-bound of the order of $T^{2/3}$. We also propose an accelerated version of the algorithm, relying on repeated sequential testing, that significantly improves the practical behavior of the approach.

URL: https://openreview.net/forum?id=m1OXBLH0dH

---

Title: Routers in Vision Mixture of Experts: An Empirical Study

Authors: Tianlin Liu, Mathieu Blondel, Carlos Riquelme Ruiz, Joan Puigcerver

Abstract: Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost. A key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens). In this paper, we present a comprehensive study of routers in MoEs for computer vision tasks. We introduce a unified MoE formulation that subsumes different MoEs with two parametric routing tensors. This formulation covers both sparse MoE, which uses a binary or hard assignment between experts and tokens, and soft MoE, which uses a soft assignment between experts and weighted combinations of tokens. Routers for sparse MoEs can be further grouped into two variants: Token Choice, which matches experts to each token, and Expert Choice, which matches tokens to each expert. We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse MoE, Expert Choice routers generally outperform Token Choice routers, and (iii) soft MoEs generally outperform sparse MoEs with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision MoE models.

URL: https://openreview.net/forum?id=aHk3vctnf1

---

Title: Sketch and shift: a robust decoder for compressive clustering

Authors: Ayoub Belhadji, Rémi Gribonval

Abstract: Compressive learning is an emerging approach to drastically reduce the memory footprint of large-scale learning, by first summarizing a large dataset into a low-dimensional sketch vector, and then decoding from this sketch the latent information needed for learning. In light of recent progress on information preservation guarantees for sketches based on random features, a major objective is to design easy-to-tune algorithms (called decoders) to robustly and efficiently extract this information. To address the underlying non-convex optimization problems, various heuristics have been proposed. In the case of compressive clustering, the standard heuristic is CL-OMPR, a variant of sliding Frank-Wolfe. Yet, CL-OMPR is hard to tune, and the examination of its robustness was overlooked.
In this work, we undertake a scrutinized examination of CL-OMPR to circumvent its limitations. In particular, we show how this algorithm can fail to recover the clusters even in advantageous scenarios. To gain insight, we show how the deficiencies of this algorithm can be attributed to optimization difficulties related to the structure of a correlation function appearing at core steps of the algorithm. To address these limitations, we propose an alternative decoder offering substantial improvements over CL-OMPR. Its design is notably inspired from the mean shift algorithm, a classic approach to detect the local maxima of kernel density estimators. The proposed algorithm can extract clustering information from a sketch of the MNIST dataset that is 10 times smaller than previously.

URL: https://openreview.net/forum?id=6rWuWbVmgz

---

Title: Inference from Real-World Sparse Measurements

Authors: Arnaud Pannatier, Kyle Matoba, François Fleuret

Abstract: Real-world problems often involve complex and unstructured sets of measurements, which occurs when sensors are sparsely placed in either space or time. Being able to model this irregular spatiotemporal data and extract meaningful forecasts is crucial. Deep learning architectures capable of processing sets of measurements with positions varying from set to set, and extracting readouts anywhere are methodologically difficult. Current state-of-the-art models are graph neural networks and require domain-specific knowledge for proper setup.

We propose an attention-based model focused on robustness and practical applicability, with two key design contributions. First, we adopt a ViT-like transformer that takes both context points and read-out positions as inputs, eliminating the need for an encoder-decoder structure. Second, we use a unified method for encoding both context and read-out positions. This approach is intentionally straightforward and integrates well with other systems. Compared to existing approaches, our model is simpler, requires less specialized knowledge, and does not suffer from a problematic bottleneck effect, all of which contribute to superior performance.

We conduct in-depth ablation studies that characterize this problematic bottleneck in the latent representations of alternative models that inhibit information utilization and impede training efficiency. We also perform experiments across various problem domains, including high-altitude wind nowcasting, two-day weather forecasting, fluid dynamics, and heat diffusion. Our attention-based model consistently outperforms state-of-the-art models in handling irregularly sampled data. Notably, our model reduces the root mean square error (RMSE) for wind nowcasting from 9.24 to 7.98 and for heat diffusion tasks from 0.126 to 0.084.

URL: https://openreview.net/forum?id=y9IDfODRns

---

Title: What Has Been Overlooked in Contrastive Source-Free Domain Adaptation: Leveraging Source-Informed Latent Augmentation within Neighborhood Context

Authors: Jing Wang, Wonho Bae, Jiahong Chen, Kuangen Zhang, Leonid Sigal, Clarence W. de Silva

Abstract: Source-free domain adaptation (SFDA) involves adapting a model originally trained using a labeled dataset (source domain) to perform effectively on an unlabeled dataset (target domain) without relying on any source data during adaptation. This adaptation is especially crucial when significant disparities in data distributions exist between the two domains and when there are privacy concerns regarding the source model's training data. The absence of access to source data during adaptation makes it challenging to analytically estimate the domain gap. To tackle this issue, various techniques have been proposed, such as unsupervised clustering, contrastive learning, and continual learning. In this paper, we first conduct an extensive theoretical analysis of SFDA based on contrastive learning, primarily because it has demonstrated superior performance compared to other techniques. Motivated by the obtained insights, we then introduce a straightforward yet highly effective latent augmentation method tailored for contrastive SFDA. This augmentation method leverages the dispersion of latent features within the neighborhood of the query sample, guided by the source pre-trained model, to enhance the informativeness of positive keys. Our approach, based on a single InfoNCE-based contrastive loss, outperforms state-of-the-art SFDA methods on widely recognized benchmark datasets.

URL: https://openreview.net/forum?id=iulMde3dP1

---

Title: E-Valuating Classifier Two-Sample Tests

Authors: Teodora Pandeva, Tim Bakker, Christian A. Naesseth, Patrick Forré

Abstract: We introduce a powerful deep classifier two-sample test for high-dimensional data based on E-values, called E-C2ST. Our test combines ideas from existing work on split likelihood ratio tests and predictive independence tests. The resulting E-values are suitable for anytime-valid sequential two-sample tests. This feature allows for more effective use of data in constructing test statistics. Through simulations and real data applications, we empirically demonstrate that E-C2ST achieves enhanced statistical power by partitioning datasets into multiple batches, beyond the conventional two-split (training and testing) approach of standard two-sample classifier tests. This strategy increases the power of the test, while keeping the type I error well below the desired significance level.

URL: https://openreview.net/forum?id=dwFRov8xhr

---

Title: Semantic Positive Pairs for Enhancing Visual Representation Learning of Instance Discrimination methods

Authors: Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

Abstract: Self-supervised learning algorithms (SSL) based on instance discrimination have shown promising results, performing competitively or even outperforming supervised learning counterparts in some downstream tasks. Such approaches employ data augmentation to create two views of the same instance (i.e., positive pairs) and encourage the model to learn good representations by attracting these views closer in the embedding space without collapsing to the trivial solution. However, data augmentation is limited in representing positive pairs, and the repulsion process between the instances during contrastive learning may discard important features for instances that have similar categories. To address this issue, we propose an approach to identify those images with similar semantic content and treat them as positive instances, thereby reducing the chance of discarding important features during representation learning and increasing the richness of the latent representation. Our approach is generic and could work with any self-supervised instance discrimination frameworks such as MoCo and SimSiam. To evaluate our method, we run experiments on three benchmark datasets: ImageNet, STL-10 and CIFAR-10 with different instance discrimination SSL approaches. The experimental results show that our approach consistently outperforms the baseline methods across all three datasets; for instance, we improve upon the vanilla MoCo-v2 by 4.1% on ImageNet under a linear evaluation protocol over 800 epochs. We also report results on semi-supervised learning, transfer learning on downstream tasks, and object detection.

URL: https://openreview.net/forum?id=z5AXLMBWdU

---

Title: What do larger image classifiers memorise?

Authors: Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

Abstract: The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (“memorise”) completely random labels. To carefully study this issue, Feldman (2019) proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the corresponding memorisation profile of a ResNet on image classification benchmarks. While an exciting first glimpse into what real-world models memorise, this leaves open a fundamental question: do larger neural models memorise more? This aligns with the common practice of training models of different sizes, each offering different cost-quality trade-offs: while larger models are typically observed to have higher quality, it is of interest to understand whether this is merely a consequence of them memorising larger numbers of input-output patterns. We present a comprehensive empirical analysis of this question on image classification benchmarks. We find that training examples exhibit an unexpectedly diverse set of memorisation trajectories across model sizes: most samples experienced decreased memorisation under larger models, while the rest exhibit cap-shaped or increasing memorisation. We show that various proxies for the Feldman(2019) memorisation score fail to capture these fundamental trends. Lastly, we find that knowledge distillation — an effective and popular model compression technique — tends to inhibit memorisation, while also improving generalisation. Specifically, memorisation is mostly inhibited on examples with increasing memorisation trajectories, thus pointing at how distillation improves generalisation.

URL: https://openreview.net/forum?id=Ew73inSyhG

---

Title: Integrated Variational Fourier Features for Fast Spatial Modelling with Gaussian Processes

Authors: Talay M Cheema, Carl Edward Rasmussen

Abstract: Sparse variational approximations are popular methods for scaling up inference and learning in Gaussian processes to larger datasets. For $N$ training points, exact inference has $O(N^3)$ cost; with $M \ll N$ features, state of the art sparse variational methods have $O(NM^2)$ cost. Recently, methods have been proposed using more sophisticated features; these promise $O(M^3)$ cost, with good performance in low dimensional tasks such as spatial modelling, but they only work with a very limited class of kernels, excluding some of the most commonly used. In this work, we propose integrated Fourier features, which extends these performance benefits to a very broad class of stationary covariance functions. We motivate the method and choice of parameters from a convergence analysis and empirical exploration, and show practical speedup in synthetic and real world spatial regression tasks.

URL: https://openreview.net/forum?id=PtBzWCaCYB

---

Title: Time Series Continuous Modeling for Imputation and Forecasting with Implicit Neural Representations

Authors: Etienne Le Naour, Louis Serrano, Léon Migus, Yuan Yin, Ghislain Agoua, Nicolas Baskiotis, patrick gallinari, Vincent Guigue

Abstract: We introduce a novel modeling approach for time series imputation and forecasting, tailored to address the challenges often encountered in real-world data, such as irregular samples, missing data, or unaligned measurements from multiple sensors. Our method relies on a continuous-time-dependent model of the series' evolution dynamics. It leverages adaptations of conditional, implicit neural representations for sequential data. A modulation mechanism, driven by a meta-learning algorithm, allows adaptation to unseen samples and extrapolation beyond observed time-windows for long-term predictions. The model provides a highly flexible and unified framework for imputation and forecasting tasks across a wide range of challenging scenarios. It achieves state-of-the-art performance on classical benchmarks and outperforms alternative time-continuous models.

URL: https://openreview.net/forum?id=P1vzXDklar

---

Title: Synthesizing Libraries of Programs with Auxiliary Functions

Authors: Habibur Rahman, Thirupathi Reddy Emireddy, Kenneth Tjhia, Elham Parhizkar, Levi Lelis

Abstract: A common approach to program synthesis is to use a learned function to guide the search for a program that satisfies the user's intent. In this paper, we propose a method that offers search guidance, through a domain-dependent auxiliary function, that can be orthogonal to the guidance previous functions provide. Our method, which we call Auxiliary-Based Library Learning (Aulile), searches for a solution in the program space using a base algorithm. If this search does not produce a solution, Aulile enhances the language with a library of programs discovered in the search that optimizes for the auxiliary function. Then, it repeats the search with this library-augmented language. This process is repeated until a solution is found or the system reaches a timeout. We evaluate Aulile in string manipulation tasks. Aulile improved, in some cases by a large margin, the performance of several base algorithms that use different search and learning strategies: Bus, Bustle, Crossbeam, and Bee Search. Our results suggest that Aulile offers an effective method of injecting domain knowledge into existing systems through a library learning scheme that optimizes for an auxiliary function.

URL: https://openreview.net/forum?id=tP1PBrMUlX

---

Title: Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

Authors: Jixuan Leng, Yijiang Li, Haohan Wang

Abstract: Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically CLIP, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains.
Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation.
In parallel, we introduce a novel cross-modality module that seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions.
We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG.

URL: https://openreview.net/forum?id=4KLwep6mA1

---

Title: Anticipatory Music Transformer

Authors: John Thickstun, David Leo Wright Hall, Chris Donahue, Percy Liang

Abstract: We introduce anticipation: a method for constructing a controllable generative model of a temporal point process (the event process) conditioned asynchronously on realizations of a second, correlated process (the control process). We achieve this by interleaving sequences of events and controls, such that controls appear following stopping times in the event sequence. This work is motivated by problems arising in the control of symbolic music generation. We focus on infilling control tasks, whereby the controls are a subset of the events themselves, and conditional generation completes a sequence of events given the fixed control events. We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset. These models match the performance of autoregressive models for prompted generation, with the additional capability to perform infilling control tasks, including accompaniment. Human evaluators report that an anticipatory model produces accompaniments with similar musicality to even music composed by humans over a 20-second clip.

URL: https://openreview.net/forum?id=EBNJ33Fcrl

---

Title: Fooling Contrastive Language-Image Pre-Trained Models with CLIPMasterPrints

Authors: Matthias Freiberger, Peter Kun, Christian Igel, Anders Sundnes Løvlie, Sebastian Risi

Abstract: Models leveraging both visual and textual data such as Contrastive Language-Image Pre-training (CLIP), are the backbone of many recent advances in artificial intelligence. In this work, we show that despite their versatility, such models are vulnerable to what we refer to as fooling master images. Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts, while being either unrecognizable or unrelated to the attacked prompts for humans. We demonstrate how fooling master images can be mined using stochastic gradient descent, projected gradient descent, or gradient-free optimisation. Contrary to many common adversarial attacks, the gradient-free optimisation approach allows us to mine fooling examples even when the weights of the model are not accessible. We investigate the properties of the mined fooling master images, and find that images trained on a small number of image captions potentially generalize to a much larger number of semantically related captions. Finally, we evaluate possible mitigation strategies and find that vulnerability to fooling master examples appears to be closely related to a modality gap in contrastive pre-trained multi-modal networks.

URL: https://openreview.net/forum?id=ZFZnvGXXMm

---

Title: A True-to-the-model Axiomatic Benchmark for Graph-based Explainers

Authors: Corrado Monti, Paolo Bajardi, Francesco Bonchi, André Panisson, Alan Perotti

Abstract: Regulators, researchers, and practitioners recognize the urgency of explainability in artificial intelligence systems, including the ones based on machine learning for graph-structured data. Despite the large number of proposals, however, a common understanding of what constitutes a good explanation is still lacking: different explainers often arrive at different conclusions on the same problem instance, making it hard for practitioners to choose among them. Furthermore, explainers often produce explanations through opaque logic hard to understand and assess -- ironically mirroring the black box nature they aim to elucidate.

Recent proposals in the literature for benchmarking graph-based explainers typically involve embedding specific logic into data, training a black-box model, and then empirically assessing how well the explanation matches the embedded logic, i.e., they test truthfulness to the data. In contrast, we propose a true-to-the-model axiomatic framework for auditing explainers in the task of node classification on graphs.
Our proposal hinges on the fundamental idea that an explainer should discern if a model relies on a particular feature for classifying a node.
Building on this concept, we develop three types of white-box classifiers, with clear internal logic, that are relevant in real-world applications. We then formally prove that the set of features that can induce a change in the classification correctly corresponds to a ground-truth set of predefined important features. This property allows us to use the white-box classifiers to build a testing framework.

We apply this framework to both synthetic and real data and evaluate various state-of-the-art explainers, thus characterizing their behavior. Our findings highlight how explainers often react in a rather counter-intuitive fashion to technical details that might be easily overlooked. Our approach offers valuable insights and recommended practices for selecting the right explainer given the task at hand, and for developing new methods for explaining graph-learning models.

URL: https://openreview.net/forum?id=HSQTv3R8Iz

---

Title: Variance-aware decision making with linear function approximation under heavy-tailed rewards

Authors: Xiang Li, Qiang Sun

Abstract: This paper studies how to achieve variance-aware regrets for online decision-making in the presence of heavy-tailed rewards with only finite variances. For linear stochastic bandits, we address the issue of heavy-tailed rewards by modifying the adaptive Huber regression and proposing AdaOFUL. AdaOFUL achieves a state-of-the-art regret bound of $\widetilde{\mathcal{O}}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$ as if the rewards were uniformly bounded, where $\nu_{t}^2$ is the conditional variance of the reward at round $t$, $d$ is the feature dimension, {and $T$ is number of online rounds}. Building upon AdaOFUL, we propose VARA for linear MDPs, which achieves a variance-aware regret bound of $\widetilde{\mathcal{O}}(d\sqrt{H\mathcal{G}^*K})$. Here, $H$ is the length of episodes, $K$ is the number of episodes, and $\mathcal{G}^*$ is a smaller instance-dependent quantity that can be bounded by other instance-dependent quantities when additional structural conditions on the MDP are satisfied. Overall, our modified adaptive Huber regression algorithm may serve as a useful building block in the design of algorithms for online problems with heavy-tailed rewards.

URL: https://openreview.net/forum?id=8bnsoL2IyJ

---

Title: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Authors: Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexander A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Kenealy, Kevin Swersky, Kshiteej Mahajan, Laura A Culp, Lechao Xiao, Maxwell Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, Noah Fiedel

Abstract: Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call \method, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that \method{} scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can reduce dependence on human-generated data.

URL: https://openreview.net/forum?id=lNAyUngGFK

---

Title: ZigZag: Universal Sampling-free Uncertainty Estimation Through Two-Step Inference

Authors: Nikita Durasov, Nik Dorndorf, Hieu Le, Pascal Fua

Abstract: Whereas the ability of deep networks to produce useful predictions on many kinds of data has been amply demonstrated, estimating the reliability of these predictions remains challenging. Sampling approaches such as MC-Dropout and Deep Ensembles have emerged as the most popular ones for this purpose. Unfortunately, they require many forward passes at inference time, which slows them down. Sampling-free approaches can be faster but often suffer from other drawbacks, such as lower reliability of uncertainty estimates, difficulty of use, and limited applicability to different types of tasks and data.

In this work, we introduce a sampling-free approach that is generic and easy to deploy, while producing reliable uncertainty estimates on par with state-of-the-art methods at a significantly lower computational cost. It is predicated on training the network to produce the same output with and without additional information about it. At inference time, when no prior information is given, we use the network's own prediction as the additional information. We then take the distance between the predictions with and without prior information as our uncertainty measure.

We demonstrate our approach on several classification and regression tasks. We show that it delivers results on par with those of Ensembles but at a much lower computational cost.

URL: https://openreview.net/forum?id=QSvb6jBXML

---

Title: On the Optimization and Generalization of Multi-head Attention

Authors: Puneesh Deora, Rouzbeh Ghaderi, Hossein Taheri, Christos Thrampoulidis

Abstract: The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.

URL: https://openreview.net/forum?id=wTGjn7JvYK

---

Title: Personalised Federated Learning On Heterogeneous Feature Spaces

Authors: Alain Rakotomamonjy, Maxime Vono, Hamlet Jesse Medina Ruiz, Liva Ralaivola

Abstract: Personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common space \emph{i.e.} all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use {\em heterogeneous} data representations. To bridge the gap between the assumption of a shared subspace and the more realistic situation of client-specific spaces, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions, in a federated manner. Preservation of class information in the latent space is ensured by a distribution alignment with respect to a learned reference distribution. We provide the algorithmic details of FLIC as well as theoretical insights supporting the relevance of our methodology. We compare its performances against FL benchmarks involving heterogeneous input features spaces. Notably, we are the first to present a successful application of FL to Brain-Computer Interface signals acquired on a different number of sensors.

URL: https://openreview.net/forum?id=uCZJaqJchs

---

Title: MUBen: Benchmarking the Uncertainty of Molecular Representation Models

Authors: Yinghao Li, Lingkai Kong, Yuanqi Du, Yue Yu, Yuchen Zhuang, Wenhao Mu, Chao Zhang

Abstract: Large molecular representation models pre-trained on massive unlabeled data have shown great success in predicting molecular properties. However, these models may tend to overfit the fine-tuning data, resulting in over-confident predictions on test data that fall outside of the training distribution. To address this issue, uncertainty quantification (UQ) methods can be used to improve the models' calibration of predictions. Although many UQ approaches exist, not all of them lead to improved performance. While some studies have included UQ to improve molecular pre-trained models, the process of selecting suitable backbone and UQ methods for reliable molecular uncertainty estimation remains underexplored. To address this gap, we present MUBen, which evaluates different UQ methods for state-of-the-art backbone molecular representation models to investigate their capabilities. By fine-tuning various backbones using different molecular descriptors as inputs with UQ methods from different categories, we assess the influence of architectural decisions and training strategies on property prediction and uncertainty estimation. Our study offers insights for selecting UQ for backbone models, which can facilitate research on uncertainty-critical applications in fields such as materials science and drug discovery.

URL: https://openreview.net/forum?id=qYceFeHgm4

---

Title: New Guarantees for Learning Revenue Maximizing Menus of Lotteries and Two-Part Tariffs

Authors: Maria Florina Balcan, Hedyeh Beyhaghi

Abstract: We advance a recently flourishing line of work at the intersection of learning theory and computational economics by studying the learnability of two classes of mechanisms prominent in economics, namely menus of lotteries and two-part tariffs. The former is a family of randomized mechanisms designed for selling multiple items, known to achieve revenue beyond deterministic mechanisms, while the latter is designed for selling multiple units (copies) of a single item with applications in real-world scenarios such as car or bike-sharing services. We focus on learning high-revenue mechanisms of this form from buyer valuation data in both distributional settings, where we have access to buyers’ valuation samples up-front, and the more challenging and less-studied online settings, where buyers arrive one-at-a-time and no distributional assumption is made about their values. We provide a suite of results with regard to these two families of mechanisms. We provide the first online learning algorithms for menus of lotteries and two-part tariffs with strong regret-bound guarantees. Since the space of parameters is infinite and the revenue functions have discontinuities, the known techniques do not readily apply. However, we are able to provide a reduction to online learning over a finite number of experts, in our case, a finite number of parameters. Furthermore, in the limited buyers type case, we show a reduction to online linear optimization, which allows us to obtain no-regret guarantees by presenting buyers with menus that correspond to a barycentric spanner. In addition, we provide algorithms with improved running times over prior work for the distributional settings. Finally, we demonstrate how techniques from the recent literature in data-driven algorithm design are insufficient for our studied problems.

URL: https://openreview.net/forum?id=mhawjZcmrJ

---

Title: GSURE-Based Diffusion Model Training with Corrupted Data

Authors: Bahjat Kawar, Noam Elata, Tomer Michaeli, Michael Elad

Abstract: Diffusion models have demonstrated impressive results in both data generation and downstream tasks such as inverse problems, text-based editing, classification, and more. However, training such models usually requires large amounts of clean signals which are often difficult or impossible to obtain. In this work, we propose a novel training technique for generative diffusion models based only on corrupted data. We introduce a loss function based on the Generalized Stein’s Unbiased Risk Estimator (GSURE), and prove that under some conditions, it is equivalent to the training objective used in fully supervised diffusion models. We demonstrate our technique on face images as well as Magnetic Resonance Imaging (MRI), where the use of undersampled data significantly alleviates data collection costs. Our approach achieves generative performance comparable to its fully supervised counterpart without training on any clean signals. In addition, we deploy the resulting diffusion model in various downstream tasks beyond the degradation present in the training set, showcasing promising results.

URL: https://openreview.net/forum?id=BRl7fqMwaJ

---

Title: A note on regularised NTK dynamics with an application to PAC-Bayesian training

Authors: Eugenio Clerico, Benjamin Guedj

Abstract: We establish explicit dynamics for neural networks whose training objective has a regularising term that constrains the parameters to remain close to their initial value. This keeps the network in a lazy training regime, where the dynamics can be linearised around the initialisation. The standard neural tangent kernel (NTK) governs the evolution during the training in the infinite-width limit, although the regularisation yields an additional term appears in the differential equation describing the dynamics. This setting provides an appropriate framework to study the evolution of wide networks trained to optimise generalisation objectives such as PAC-Bayes bounds, and hence contribute to a deeper theoretical understanding of such networks.

URL: https://openreview.net/forum?id=2la55BeWwy

---

Title: Reproducibility Study of "Robust Fair Clustering: A Novel Fairness Attack and Defense Framework"

Authors: Iason Skylitsis, Zheng Feng, Idries Nasim, Camille Niessink

Abstract: Clustering algorithms play a pivotal role in various societal applications, where fairness is paramount to prevent adverse impacts on individuals. In this study, we revisit the robustness of fair clustering algorithms against adversarial attacks, affirming previous research findings that highlighted their susceptibility and the resilience of the Consensus Fair Clustering (CFC) model. Beyond reproducing these critical results, our work extends the original analysis by refining the codebase for enhanced experimentation, introducing additional metrics and datasets to deepen the evaluation of fairness and clustering quality, and exploring novel attack strategies, including targeted attacks on new metrics and a combined approach for balance and entropy as well as an ablation study. These contributions validate the original claims about the vulnerability and resilience of fair clustering algorithms and broaden the research landscape by offering a more comprehensive toolkit for assessing adversarial robustness in fair clustering.

URL: https://openreview.net/forum?id=H1hLNjwrGy

---

Title: New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking

Authors: Karanpartap Singh, James Zou

Abstract: With the increasing use of large-language models (LLMs) like ChatGPT, watermarking has emerged as a promising approach for tracing machine-generated content. However, research on LLM watermarking often relies on simple perplexity or diversity-based measures to assess the quality of watermarked text, which can mask important limitations in watermarking. Here we introduce two new easy-to-use methods for evaluating watermarking algorithms for LLMs: 1) evaluation by LLM-judger with specific guidelines; and 2) binary classification on text embeddings to distinguish between watermarked and unwatermarked text. We apply these methods to characterize the effectiveness of current watermarking techniques. Our experiments, conducted across various datasets, reveal that current watermarking methods are moderately detectable by even simple classifiers, challenging the notion of watermarking subtlety. We also found, through the LLM judger, that watermarking impacts text quality, especially in degrading the coherence and depth of the response. Our findings underscore the trade-off between watermark robustness and text quality and highlight the importance of having more informative metrics to assess watermarking quality.

URL: https://openreview.net/forum?id=PuhF0hyDq1

---

Title: Does Representation Similarity Capture Function Similarity?

Authors: Lucas Hayne, Heejung Jung, R. Carter

Abstract: Representation similarity metrics are widely used to compare learned representations in neural networks, as is evident in extensive literature investigating metrics that accurately capture information encoded in representations. However, aiming to capture all of the information available in representations may have little to do with what information is actually used by the downstream network. One solution is to experiment with interventions on network function. By ablating groups of units thought to carry information and observing whether those ablations affect network performance, we can focus on an outcome that mechanistically links representations to function. In this paper, we systematically test representation similarity metrics to evaluate their sensitivity to functional changes induced by ablation. We use network performance changes after ablation as a way to measure the influence of representation on function. These measures of function allow us to test how well similarity metrics capture changes in network performance versus changes to linear decodability. Network performance measures index the information used by the downstream network, while linear decoding methods index available information in the representation. We show that all of the tested metrics are more sensitive to decodable features than network performance. When comparing these metrics, Procrustes and CKA outperform regularized CCA-based methods on average. Although Procrustes and CKA outperform on average, these metrics have a diminished advantage when looking at network performance. We provide ablation tests of the utility of different representational similarity metrics. Our results suggest that interpretability methods will be more effective if they are based on representational similarity metrics that have been evaluated using ablation tests.

URL: https://openreview.net/forum?id=YY2iA0hfia

---

Title: Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA

Authors: James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, Hongxia Jin

Abstract: Recent works demonstrate a remarkable ability to customize text-to-image diffusion models while only providing a few example images. What happens if you try to customize such models using multiple, fine-grained concepts in a sequential (i.e., continual) manner? In our work, we show that recent state-of-the-art customization of text-to-image models suffer from catastrophic forgetting when new concepts arrive sequentially. Specifically, when adding a new concept, the ability to generate high quality images of past, similar concepts degrade. To circumvent this forgetting, we propose a new method, C-LoRA, composed of a continually self-regularized low-rank adaptation in cross attention layers of the popular Stable Diffusion model. Furthermore, we use customization prompts which do not include the word of the customized object (i.e., "person" for a human face dataset) and are initialized as completely random embeddings. Importantly, our method induces only marginal additional parameter costs and requires no storage of user data for replay. We show that C-LoRA not only outperforms several baselines for our proposed setting of text-to-image continual customization, which we refer to as Continual Diffusion, but that we achieve a new state-of-the-art in the well-established rehearsal-free continual learning setting for image classification. The high achieving performance of C-LoRA in two separate domains positions it as a compelling solution for a wide range of applications, and we believe it has significant potential for practical impact.

URL: https://openreview.net/forum?id=TZdEgwZ6f3

---

Title: BP($\mathbf{\lambda}$): Online Learning via Synthetic Gradients

Authors: Joseph Oliver Pemberton, Rui Ponte Costa

Abstract: Training recurrent neural networks typically relies on backpropagation through time (BPTT). BPTT depends on forward and backward passes to be completed, rendering the network locked to these computations before loss gradients are available. Recently, Jaderberg et al. proposed synthetic gradients to alleviate the need for full BPTT. In their implementation synthetic gradients are learned through a mixture of backpropagated gradients and bootstrapped synthetic gradients, analogous to the temporal difference (TD) algorithm in Reinforcement Learning (RL). However, as in TD learning, heavy use of bootstrapping can result in bias which leads to poor synthetic gradient estimates. Inspired by the accumulate $\mathrm{TD}(\lambda)$ in RL, we propose a fully online method for learning synthetic gradients which avoids the use of BPTT altogether: \emph{accumulate} $BP(\lambda)$. As in accumulate $\mathrm{TD}(\lambda)$, we show analytically that {accumulate~$\mathrm{BP}(\lambda)$} can control the level of bias by using a mixture of temporal difference errors and recursively defined eligibility traces. We next demonstrate empirically that our model outperforms the original implementation for learning synthetic gradients in a variety of tasks, and is particularly suited for capturing longer timescales. Finally, building on recent work we reflect on accumulate $\mathrm{BP}(\lambda)$ as a principle for learning in biological circuits. In summary, inspired by RL principles we introduce an algorithm capable of bias-free online learning via synthetic gradients.

URL: https://openreview.net/forum?id=3kYgouAfqk

---

Title: GUARD: A Safe Reinforcement Learning Benchmark

Authors: Weiye Zhao, Yifan Sun, Feihan Li, Rui Chen, Ruixuan Liu, Tianhao Wei, Changliu Liu

Abstract: Due to the trial-and-error nature, it is typically challenging to apply RL algorithms to safety-critical real-world applications, such as autonomous driving, human-robot interaction, robot manipulation, etc, where such errors are not tolerable. Recently, safe RL (i.e. constrained RL) has emerged rapidly in the literature, in which the agents explore the environment while satisfying constraints. Due to the diversity of algorithms and tasks, it remains difficult to compare existing safe RL algorithms. To fill that gap, we introduce GUARD, a Generalized Unified SAfe Reinforcement Learning Development Benchmark. GUARD has several advantages compared to existing benchmarks. First, GUARD is a generalized benchmark with a wide variety of RL agents, tasks, and safety constraint specifications. Second, GUARD comprehensively covers state-of-the-art safe RL algorithms with self-contained implementations. Third, GUARD is highly customizable in tasks and algorithms. We present a comparison of state-of-the-art on-policy safe RL algorithms in various task settings using GUARD and establish baselines that future work can build on.

URL: https://openreview.net/forum?id=kZFKwApeQO

---

Title: Incremental Extractive Opinion Summarization Using Cover Trees

Authors: Somnath Basu Roy Chowdhury, Nicholas Monath, Kumar Avinava Dubey, Manzil Zaheer, Andrew McCallum, Amr Ahmed, Snigdha Chaturvedi

Abstract: Extractive opinion summarization involves automatically producing a summary of text about an entity (e.g., a product’s reviews) by extracting representative sentences that capture prevalent opinions in the review set. Typically, in online marketplaces user reviews accrue over time, and opinion summaries must be updated periodically to provide customers with up-to-date information. In this work, we study the task of extractive opinion summarization in an incremental setting, where the underlying review set evolves over time. Many of the state-of-the-art extractive opinion summarization approaches are centrality-based, such as CentroidRank (Radev et al., 2004; Chowdhury et al., 2022). CentroidRank performs extractive summarization by selecting a subset of review sentences closest to the centroid in the representation space as the summary. However, these methods are not capable of operating efficiently in an incremental setting, where reviews arrive one at a time. In this paper, we present an efficient algorithm for accurately computing the CentroidRank summaries in an incremental setting. Our approach, CoverSumm, relies on indexing review representations in a cover tree and maintaining a reservoir of candidate summary review sentences. CoverSumm’s efficacy is supported by a theoretical and empirical analysis of running time. Empirically, on a diverse collection of data (both real and synthetically created to illustrate scaling considerations), we demonstrate that CoverSumm is up to 25x faster than baseline methods, and capable of adapting to nuanced changes in data distribution. We also conduct human evaluations of the generated summaries and find that CoverSumm is capable of producing informative summaries consistent with the underlying review set.

URL: https://openreview.net/forum?id=IzmLJ1t49R

---

Title: Persistent Local Homology in Graph Learning

Authors: Minghua Wang, Yan HU, Ziyun Huang, Di Wang, Jinhui Xu

Abstract: In this study, we introduce Persistent Local Homology (PLH) for graphs, a novel method that synergizes persistent homology with local homology to analyze graph structures. We begin by mathematically formalizing PLH, defining it as the application of persistent homology to annular local subgraphs. This foundation paves the way for the development of a computational pipeline, specifically tailored for PLH, which we explore in various graph learning contexts. Despite its utility, a complexity analysis reveals potential computational bottlenecks in PLH application. To address this, we propose Reduced PLH (rPLH), an efficient variant designed to significantly lower computational complexity. Experimental evaluations with rPLH demonstrate its capability to retain the effectiveness of the original PLH while substantially reducing computational demands. The practical utility of PLH and rPLH is further corroborated through comprehensive experiments on both synthetic and real-world datasets, highlighting their broad applicability and potential in diverse analytical scenarios.

URL: https://openreview.net/forum?id=qunyX9WYr6

---

Title: Adversarially Robust Spiking Neural Networks Through Conversion

Authors: Ozan Ozdenizci, Robert Legenstein

Abstract: Spiking neural networks (SNNs) provide an energy-efficient alternative to a variety of artificial neural network (ANN) based AI applications. As the progress in neuromorphic computing with SNNs expands their use in applications, the problem of adversarial robustness of SNNs becomes more pronounced. To the contrary of the widely explored end-to-end adversarial training based solutions, we address the limited progress in scalable robust SNN training methods by proposing an adversarially robust ANN-to-SNN conversion algorithm. Our method provides an efficient approach to embrace various computationally demanding robust learning objectives that have been proposed for ANNs. During a post-conversion robust finetuning phase, our method adversarially optimizes both layer-wise firing thresholds and synaptic connectivity weights of the SNN to maintain transferred robustness gains from the pre-trained ANN. We perform experimental evaluations in a novel setting proposed to rigorously assess the robustness of SNNs, where numerous adaptive adversarial attacks that account for the spike-based operation dynamics are considered. Results show that our approach yields a scalable state-of-the-art solution for adversarially robust deep SNNs with low-latency.

URL: https://openreview.net/forum?id=I8FMYa2BdP

---

Title: Continual Learning: Applications and the Road Forward

Authors: Eli Verwimp, Rahaf Aljundi, Shai Ben-David, Matthias Bethge, Andrea Cossu, Alexander Gepperth, Tyler L. Hayes, Eyke Hüllermeier, Christopher Kanan, Dhireesha Kudithipudi, Christoph H. Lampert, Martin Mundt, Razvan Pascanu, Adrian Popescu, Andreas S. Tolias, Joost van de Weijer, Bing Liu, Vincenzo Lomonaco, Tinne Tuytelaars, Gido M van de Ven

Abstract: Continual learning is a subfield of machine learning, which aims to allow machine learning models to continuously learn on new data, by accumulating knowledge without forgetting what was learned in the past. In this work, we take a step back, and ask: "Why should one care about continual learning in the first place?". We set the stage by examining recent continual learning papers published at four major machine learning conferences, and show that memory-constrained settings dominate the field. Then, we discuss five open problems in machine learning, and even though they might seem unrelated to continual learning at first sight, we show that continual learning will inevitably be part of their solution. These problems are model editing, personalization and specialization, on-device learning, faster (re-)training and reinforcement learning. Finally, by comparing the desiderata from these unsolved problems and the current assumptions in continual learning, we highlight and discuss four future directions for continual learning research. We hope that this work offers an interesting perspective on the future of continual learning, while displaying its potential value and the paths we have to pursue in order to make it successful. This work is the result of the many discussions the authors had at the Dagstuhl seminar on Deep Continual Learning, in March 2023.

URL: https://openreview.net/forum?id=axBIMcGZn9

---

Title: Anomaly detection with semi-supervised classification based on risk estimators

Authors: Le Thi Khanh Hien, Sukanya Patra, Souhaib Ben Taieb

Abstract: A significant limitation of one-class classification anomaly detection methods is their reliance on the assumption that unlabeled training data only contains normal instances. To overcome this impractical assumption, we propose two novel classification-based anomaly detection methods. Firstly, we introduce a semi-supervised shallow anomaly detection method based on an unbiased risk estimator. Secondly, we present a semi-supervised deep anomaly detection method utilizing a nonnegative (biased) risk estimator. We establish estimation error bounds and excess risk bounds for both risk minimizers. Additionally, we propose techniques to select appropriate regularization parameters that ensure the nonnegativity of the empirical risk in the shallow model under specific loss functions. Our extensive experiments provide evidence of the effectiveness of the risk-based anomaly detection methods.

URL: https://openreview.net/forum?id=ekvsBtCBUK

---

Title: Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits

Authors: Daniel Morales-Brotons, Thijs Vogels, Hadrien Hendrikx

Abstract: Weight averaging of Stochastic Gradient Descent (SGD) iterates is a popular method for training deep learning models. While it is often used as part of complex training pipelines to improve generalization or serve as a `teacher' model, weight averaging lacks proper evaluation on its own. In this work, we present a systematic study of the Exponential Moving Average (EMA) of weights. We first explore the training dynamics of EMA, give guidelines for hyperparameter tuning, and highlight its good early performance, partly explaining its success as a teacher. We also observe that EMA requires less learning rate decay compared to SGD since averaging naturally reduces noise, introducing a form of implicit regularization. Through extensive experiments, we show that EMA solutions differ from last-iterate solutions. EMA models not only generalize better but also exhibit improved i) robustness to noisy labels, ii) prediction consistency, iii) calibration and iv) transfer learning. Therefore, we suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.

URL: https://openreview.net/forum?id=2M9CUnYnBA

---

Title: 3D Molecular Generation via Virtual Dynamics

Authors: Shuqi Lu, Lin Yao, Xi Chen, Hang Zheng, Di He, Guolin Ke

Abstract: Structure-based drug design, a critical aspect of drug discovery, aims to identify high-affinity molecules for target protein pockets. Traditional virtual screening methods, which involve exhaustive searches within large molecular databases, are inefficient and limited in discovering novel molecules. The pocket-based 3D molecular generation model offers a promising alternative by directly generating molecules with 3D structures and binding positions in the pocket. In this paper, we present VD-Gen, a novel pocket-based 3D molecular generation pipeline. VD-Gen features a series of carefully designed stages to generate fine-grained 3D molecules with binding positions in the pocket cavity end-to-end. Rather than directly generating or sampling atoms with 3D positions in the pocket, VD-Gen randomly initializes multiple virtual particles within the pocket and learns to iteratively move them to approximate the distribution of molecular atoms in 3D space. After the iterative movement, a 3D molecule is extracted and further refined through additional iterative movement, yielding a high-quality 3D molecule with a confidence score. Comprehensive experimental results on pocket-based molecular generation demonstrate that VD-Gen can generate novel 3D molecules that fill the target pocket cavity with high binding affinities, significantly outperforming previous baselines.

URL: https://openreview.net/forum?id=QvipGVdE6L

---

Title: Scalable Hierarchical Self-Attention with Learnable Hierarchy for Long-Range Interactions

Authors: Thuan Nguyen Anh Trang, Khang Nhat Ngo, Hugo Sonnery, Thieu Vo, Siamak Ravanbakhsh, Truong Son Hy

Abstract: Self-attention models have made great strides toward accurately modeling a wide array of data modalities, including, more recently, graph-structured data. This paper demonstrates that adaptive hierarchical attention can go a long way toward successfully applying transformers to graphs. Our proposed model Sequoia provides a powerful inductive bias towards long-range interaction modeling, leading to better generalization. We propose an end-to-end mechanism for a data-dependent construction of a hierarchy which in turn guides the self-attention mechanism. Using adaptive hierarchy provides a natural pathway toward sparse attention by constraining node-to-node interactions with the immediate family of each node in the hierarchy (e.g., parent, children, and siblings). This in turn dramatically reduces the computational complexity of a self-attention layer from quadratic to log-linear in terms of the input size while maintaining or sometimes even surpassing the standard transformer's ability to model long-range dependencies across the entire input. Experimentally, we report state-of-the-art performance on long-range graph benchmarks while remaining computationally efficient. Moving beyond graphs, we also display competitive performance on long-range sequence modeling, point-clouds classification, and segmentation when using a fixed hierarchy. Our source code is publicly available at https://github.com/HySonLab/HierAttention

URL: https://openreview.net/forum?id=qH4YFMyhce

---

Title: A Unified View of Differentially Private Deep Generative Modeling

Authors: Dingfan Chen, Raouf Kerkouche, Mario Fritz

Abstract: The availability of rich and vast data sources has greatly advanced machine learning applications in various domains. However, data with privacy concerns comes with stringent regulations that frequently prohibit data access and data sharing. Overcoming these obstacles in compliance with privacy considerations is key for technological progress in many real-world application scenarios that involve sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released, enabling privacy-preserving downstream analysis and reproducible research in sensitive domains. In recent years, various approaches have been proposed for achieving privacy-preserving high-dimensional data generation by private training on top of deep neural networks. In this paper, we present a novel unified view that systematizes these approaches. Our view provides a joint design space for systematically deriving methods that cater to different use cases. We then discuss the strengths, limitations, and inherent correlations between different approaches,
aiming to shed light on crucial aspects and inspire future research.
We conclude by presenting potential paths forward for the field of DP data generation, with the aim of steering the community toward making the next important steps in advancing privacy-preserving learning.

URL: https://openreview.net/forum?id=YgmBD2c9qX

---

Title: The Cross-entropy of Piecewise Linear Probability Density Functions

Authors: Tom S. F. Haines

Abstract: The cross-entropy and its related terms from information theory (e.g.~entropy, Kullback–Leibler divergence) are used throughout artificial intelligence and machine learning. This includes many of the major successes, both current and historic, where they commonly appear as the natural objective of an optimisation procedure for learning model parameters, or their distributions. This paper presents a novel derivation of the differential cross-entropy between two 1D probability density functions represented as piecewise linear functions. Implementation challenges are resolved and experimental validation is presented, including a rigorous analysis of accuracy and a demonstration of using the presented result as the objective of a neural network. Previously, cross-entropy would need to be approximated via numerical integration, or equivalent, for which calculating gradients is impractical. Machine learning models with high parameter counts are optimised primarily with gradients, so if piecewise linear density representations are to be used then the presented analytic solution is essential. This paper contributes the necessary theory for the practical optimisation of information theoretic objectives when dealing with piecewise linear distributions directly. Removing this limitation expands the design space for future algorithms.

URL: https://openreview.net/forum?id=AoOi9Zgdsv

---

Title: How good is Good-Turing for Markov samples?

Authors: Prafulla Chandra, Andrew Thangaraj, Nived Rajaraman

Abstract: The Good-Turing (GT) estimator for the missing mass (i.e., total probability of missing symbols) in $n$ samples is the number of symbols that appeared exactly once divided by $n$. For i.i.d. samples, the bias and squared-error risk of the GT estimator can be shown to fall as $1/n$ by bounding the expected error uniformly over all symbols. In this work, we study convergence of the GT estimator for missing stationary mass (i.e., total stationary probability of missing symbols) of Markov samples on an alphabet $\mathcal{X}$ with stationary distribution $[\pi_x:x\in\cX]$ and transition probability matrix (t.p.m.) $P$. This is an important and interesting problem because GT is widely used in applications with temporal dependencies such as language models assigning probabilities to word sequences, which are modelled as Markov. We show that convergence of GT depends on convergence of $(P^{\sim x})^n$, where $P^{\sim x}$ is $P$ with the $x$-th column zeroed out. This, in turn, depends on the Perron eigenvalue $\lambda^{\sim x}$ of $P^{\sim x}$ and its relationship with $\pi_x$ uniformly over $x$. For randomly generated t.p.ms and t.p.ms derived from New York Times and Charles Dickens corpora, we numerically exhibit such uniform-over-$x$ relationships between $\lambda^{\sim x}$ and $\pi_x$. This supports the observed success of GT in language models and practical text data scenarios. For Markov chains with rank-2, diagonalizable t.p.ms having spectral gap $\beta$, we show minimax rate upper and lower bounds of $1/(n\beta^5)$ and $1/(n\beta)$, respectively, for the estimation of stationary missing mass. This theoretical result extends the $1/n$ minimax rate for i.i.d. or rank-1 t.p.ms to rank-2 Markov, and is a first such minimax rate result for missing mass of Markov samples. We also show, through experiments, that the MSE of GT decays at a slower rate as the rank of the t.p.m increases.

URL: https://openreview.net/forum?id=KokkP2nQ24

---

Title: State-wise Constrained Policy Optimization

Authors: Weiye Zhao, Rui Chen, Yifan Sun, Feihan Li, Tianhao Wei, Changliu Liu

Abstract: Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.

URL: https://openreview.net/forum?id=NgK5etmhz9

---

Title: Provable Membership Inference Privacy

Authors: Zachary Izzo, Jinsung Yoon, Sercan O Arik, James Zou

Abstract: In applications involving sensitive data, such as finance and healthcare, the necessity for preserving data privacy can be a significant barrier to machine learning model development. Differential privacy (DP) has emerged as one canonical standard for provable privacy. However, DP’s strong theoretical guarantees often come at the cost of a large drop in its utility for machine learning; and DP guarantees themselves are difficult to interpret. In this work, we propose a novel privacy notion, membership inference privacy (MIP), as a step towards addressing these challenges. We give a precise characterization of the relationship between MIP and DP, and show that in some cases, MIP can be achieved using less amount of randomness compared to the amount required for guaranteeing DP, leading to smaller drop in utility. MIP guarantees are also easily interpretable in terms of the success rate of membership inference attacks in a simple random subsampling setting. As a proof of concept, we also provide a simple algorithm for guaranteeing MIP without needing to guarantee DP.

URL: https://openreview.net/forum?id=3ludyxPbb6

---

Title: Adaptive Conformal Regression with Split-Jackknife+ Scores

Authors: Nicolas Deutschmann, Mattia Rigotti, Maria Rodriguez Martinez

Abstract: We introduce an extension of conformal predictions (CP) based on a combination of split-CP and the Jackknife+ procedure that enables tuning score functions to calibration data and designed to produce dynamically-sized prediction interval in regression settings.
We motivate this method with theoretical results on distribution-dependent conditional coverage guarantees for split-CP and Jackknife+ prediction sets which are determined by the statistical dependence between input data and prediction scores.
This dependence can be reduced by adapting the score function to the data distribution, thereby improving the conditional validity of conformal prediction sets.
As an illustration, we construct a variant of the MADSplit conformal regression procedure where conditional mean estimates are computed in-distribution and show through empirical validation that our method is more robust to overfitting effects than the original method, while being more sample-efficient than modern ECDF-based methods.

URL: https://openreview.net/forum?id=1fbTGC3BUD

---

Title: Neural networks can be FLOP-efficient integrators of 1D oscillatory integrands

Authors: Anshuman Sinha, Spencer H Bryngelson

Abstract: We demonstrate that neural networks can be FLOP-efficient integrators of one-dimensional oscillatory integrands. We train a feed-forward neural network to compute integrals of highly oscillatory 1D functions. The training set is a parametric combination of functions with varying characters and oscillatory behavior degrees. Numerical examples show that these networks are FLOP-efficient for sufficiently oscillatory integrands with an average FLOP gain of $10^{3}$ FLOPs. The network calculates oscillatory integrals better than traditional quadrature methods under the same computational budget or number of floating point operations. We find that feed-forward networks of 5 hidden layers are satisfactory for a relative accuracy of $10^{-3}$. The computational burden of inference of the neural network is relatively small, even compared to inner-product pattern quadrature rules. We postulate that our result follows from learning latent patterns in the oscillatory integrands that are otherwise opaque to traditional numerical integrators.

URL: https://openreview.net/forum?id=5psgQEHn6t

---

Title: Towards generalizing deep-audio fake detection networks

Authors: Konstantin Gasenzer, Moritz Wolter

Abstract: Today's generative neural networks allow the creation of high-quality synthetic speech at scale. While we welcome the creative use of this new technology, we must also recognize the risks. As synthetic speech is abused for monetary and identity theft, we require a broad set of deepfake identification tools. Furthermore, previous work reported a limited ability of deep classifiers to generalize to unseen audio generators. We study the frequency domain fingerprints of current audio generators. Building on top of the discovered frequency footprints, we train excellent lightweight detectors that generalize. We report improved results on the WaveFake dataset and an extended version. To account for the rapid progress in the field, we extend the WaveFake dataset by additionally considering samples drawn from the novel Avocodo and BigVGAN networks. For illustration purposes, the supplementary material contains audio samples of generator artifacts.

URL: https://openreview.net/forum?id=RGewtLtvHz

---

Title: Scaling (Down) CLIP: A Comprehensive Analysis of Data,Architecture, and Training Strategies

Authors: Zichao Li, Cihang Xie, Ekin Dogus Cubuk

Abstract: This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.

URL: https://openreview.net/forum?id=t4nnCi5AO6

---

Title: Low-Rank Tensor-Network Encodings for Video-to-Action Behavioral Cloning

Authors: Brian Chen, Doruk Aksoy, David J Gorsich, Shravan Veerapaneni, Alex Gorodetsky

Abstract: We describe a tensor-network latent-space encoding approach for increasing the scalability of behavioral cloning of a video game player’s actions entirely from video streams of the gameplay. Specifically, we address challenges associated with the high computational requirements of traditional deep-learning based encoders such as convolutional variational autoencoders that prohibit their use in widely available hardware or for large scale data. Our approach uses tensor networks instead of deep variational autoencoders for this purpose, and it yields significant speedups with no loss of accuracy. Empirical results on ATARI games demonstrate that our approach leads to a speedup in the time it takes to encode data and train a predictor using the encodings (between 2.6× to 9.6× compared to autoencoders or variational autoencoders). Furthermore, the tensor train encoding can be efficiently trained on CPU as well, which leads to comparable or better training times than the autoencoder and variational autoencoder trained on GPU (0.9× to 5.4× faster). These results suggest significant possibilities in mitigating the need for cost and time-intensive hardware for training deep-learning architectures for behavioral cloning.

URL: https://openreview.net/forum?id=w4DXLzBPPw

---

Title: EHRDiff : Exploring Realistic EHR Synthesis with Diffusion Models

Authors: Hongyi Yuan, Songchi Zhou, Sheng Yu

Abstract: Electronic health records (EHR) contain a wealth of biomedical information, serving as valuable resources for the development of precision medicine systems. However, privacy concerns have resulted in limited access to high-quality and large-scale EHR data for researchers, impeding progress in methodological development. Recent research has delved into synthesizing realistic EHR data through generative modeling techniques, where a majority of proposed methods relied on generative adversarial networks (GAN) and their variants for EHR synthesis. Despite GAN-based methods attaining state-of-the-art performance in generating EHR data, these approaches are difficult to train and prone to mode collapse. Recently introduced in generative modeling, diffusion models have established cutting-edge performance in image generation, but their efficacy in EHR data synthesis remains largely unexplored. In this study, we investigate the potential of diffusion models for EHR data synthesis and introduce a novel method, EHRDiff. Through extensive experiments, EHRDiff establishes new state-of-the-art quality for synthetic EHR data, protecting private information in the meanwhile.

URL: https://openreview.net/forum?id=DIGkJhGeqi

---

Title: Understanding Fairness Surrogate Functions in Algorithmic Fairness

Authors: Wei Yao, Zhanke Zhou, Zhicong Li, Bo Han, Yong Liu

Abstract: It has been observed that machine learning algorithms exhibit biased predictions against certain population groups. To mitigate such bias while achieving comparable accuracy, a promising approach is to introduce surrogate functions of the concerned fairness definition and solve a constrained optimization problem. However, it is intriguing in previous work that such fairness surrogate functions may yield unfair results and high instability. In this work, in order to deeply understand them, taking a widely used fairness definition—demographic parity as an example, we show that there is a surrogate-fairness gap between the fairness definition and the fairness surrogate function. Also, the theoretical analysis and experimental results about the “gap” motivate us that the fairness and stability will be affected by the points far from the decision boundary, which is the large margin points issue investigated in this paper. To address it, we propose the general sigmoid surrogate to simultaneously reduce both the surrogate-fairness gap and the variance, and offer a rigorous fairness and stability upper bound. Interestingly, the theory also provides insights into two important issues that deal with the large margin points as well as obtaining a more balanced dataset are beneficial to fairness and stability. Furthermore, we elaborate a novel and general algorithm called Balanced Surrogate, which iteratively reduces the “gap” to mitigate unfairness. Finally, we provide empirical evidence showing that our methods consistently improve fairness and stability while maintaining accuracy comparable to the baselines in three real-world datasets.

URL: https://openreview.net/forum?id=iBgmoMTlaz

---

Title: Finite-Time Analysis of Entropy-Regularized Neural Natural Actor-Critic Algorithm

Authors: Semih Cayci, Niao He, R. Srikant

Abstract: Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large (potentially infinite) state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and weight decay) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and weight decay ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.

URL: https://openreview.net/forum?id=BkEqk7pS1I

---

Title: PopulAtion Parameter Averaging (PAPA)

Authors: Alexia Jolicoeur-Martineau, Emy Gervais, Kilian FATRAS, Yan Zhang, Simon Lacoste-Julien

Abstract: Ensemble methods combine the predictions of multiple models to improve performance, but they require significantly higher computation costs at inference time. To avoid these costs, multiple neural networks can be combined into one by averaging their weights. However, this usually performs significantly worse than ensembling. Weight averaging is only beneficial when different enough to benefit from combining them, but similar enough to average well. Based on this idea, we propose PopulAtion Parameter Averaging (PAPA): a method that combines the generality of ensembling with the efficiency of weight averaging. PAPA leverages a population of diverse models (trained on different data orders, augmentations, and regularizations) while slowly pushing the weights of the networks toward the population average of the weights. We also propose PAPA variants (PAPA-all, and PAPA-2) that average weights rarely rather than continuously; all methods increase generalization, but PAPA tends to perform best. PAPA reduces the performance gap between averaging and ensembling, increasing the average accuracy of a population of models by up to 0.8% on CIFAR-10, 1.9% on CIFAR-100, and 1.6% on ImageNet when compared to training independent (non-averaged) models.

URL: https://openreview.net/forum?id=cPDVjsOytS

---

Title: The Missing U for Efficient Diffusion Models

Authors: Sergio Calvo Ordoñez, Chun-Wun Cheng, Jiahao Huang, Lipei Zhang, Guang Yang, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero

Abstract: Diffusion Probabilistic Models stand as a critical tool in generative modelling, enabling the generation of complex data distributions. This family of generative models yields record-breaking performance in tasks such as image synthesis, video generation, and molecule design. Despite their capabilities, their efficiency, especially in the reverse process, remains a challenge due to slow convergence rates and high computational costs. In this paper, we introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models that is more parameter-efficient, exhibits faster convergence, and demonstrates increased noise robustness. Experimenting with Denoising Diffusion Probabilistic Models (DDPMs), our framework operates with approximately a quarter of the parameters, and $\sim$ 30\% of the Floating Point Operations (FLOPs) compared to standard U-Nets in DDPMs. Furthermore, our model is notably faster in inference than the baseline when measured in fair and equal conditions. We also provide a mathematical intuition as to why our proposed reverse process is faster as well as a mathematical discussion of the empirical tradeoffs in the denoising downstream task. Finally, we argue that our method is compatible with existing performance enhancement techniques, enabling further improvements in efficiency, quality, and speed.

URL: https://openreview.net/forum?id=Y4YWzBiTEV

---

Title: CoDeC: Communication-Efficient Decentralized Continual Learning

Authors: Sakshi Choudhary, Sai Aparna Aketi, Gobinda Saha, Kaushik Roy

Abstract: Training at the edge utilizes continuously evolving data generated at different locations. Privacy concerns prohibit the co-location of this spatially as well as temporally distributed data, deeming it crucial to design training algorithms that enable efficient continual learning
over decentralized private data. Decentralized learning allows serverless training with spatially distributed data. A fundamental barrier in such setups is the high bandwidth cost of communicating model updates between agents. Moreover, existing works under this training paradigm are not inherently suitable for learning a temporal sequence of tasks while retaining the previously acquired knowledge. In this work, we propose CoDeC, a novel communication-efficient decentralized continual learning algorithm that addresses these challenges. We mitigate catastrophic forgetting while learning a distributed task sequence by incorporating orthogonal gradient projection within a gossip-based decentralized learning algorithm. Further, CoDeC includes a novel lossless communication compression scheme based on the gradient subspaces. We theoretically analyze the convergence rate for our algorithm and demonstrate through an extensive set of experiments that CoDeC successfully learns distributed continual tasks with minimal forgetting. The proposed compression scheme results in up to 4.8× reduction in communication costs without any loss in performance.

URL: https://openreview.net/forum?id=N05OnQG1BA

---

Title: Bias Amplification Enhances Minority Group Performance

Authors: Gaotang Li, Jiarui Liu, Wei Hu

Abstract: Neural networks produced by standard training are known to suffer from poor accuracy on rare subgroups despite achieving high accuracy on average, due to the correlations between certain spurious features and labels. Previous approaches based on worst-group loss minimization (e.g. Group-DRO) are effective in improving worse-group accuracy but require expensive group annotations for all the training samples. In this paper, we focus on the more challenging and realistic setting where group annotations are only available on a small validation set or are not available at all. We propose BAM, a novel two-stage training algorithm: in the first stage, the model is trained using a bias amplification scheme via introducing a learnable auxiliary variable for each training sample; in the second stage, we upweight the samples that the bias-amplified model misclassifies, and then continue training the same model on the reweighted dataset. Empirically, BAM achieves competitive performance compared with existing methods evaluated on spurious correlation benchmarks in computer vision and natural language processing. Moreover, we find a simple stopping criterion based on minimum class accuracy difference that can remove the need for group annotations, with little or no loss in worst-group accuracy. We perform extensive analyses and ablations to verify the effectiveness and robustness of our algorithm in varying class and group imbalance ratios.

URL: https://openreview.net/forum?id=75OwvzZZBT

---

Title: Fast and Expressive Gesture Recognition using a Combination-Homomorphic Electromyogram Encoder

Authors: Niklas Smedemark-Margulies, Yunus Bicer, Elifnur Sunger, Tales Imbiriba, Eugene Tunik, Deniz Erdogmus, Mathew Yarossi, Robin Walters

Abstract: We study the task of gesture recognition from electromyography (EMG), with the goal of enabling expressive human-computer interaction at high accuracy, while minimizing the time required for new subjects to provide calibration data.
To fulfill these goals, we define combination gestures consisting of a direction component and a modifier component.
New subjects only demonstrate the single component gestures and we seek to extrapolate from these to all possible single or combination gestures.
We extrapolate to unseen combination gestures by combining the feature vectors of real single gestures to produce synthetic training data.
This strategy allows us to provide a large and flexible gesture vocabulary, while not requiring new subjects to demonstrate combinatorially many example gestures.
We pre-train an encoder and a combination operator using self-supervision, so that we can produce useful synthetic training data for unseen test subjects.
To evaluate the proposed method, we collect a real-world EMG dataset, and measure the effect of augmented supervision against two baselines: a partially-supervised model trained with only single gesture data from the unseen subject, and a fully-supervised model trained with real single and real combination gesture data from the unseen subject.
We find that the proposed method provides a dramatic improvement over the partially-supervised model, and achieves a useful classification accuracy that in some cases approaches the performance of the fully-supervised model.

URL: https://openreview.net/forum?id=j5T4pcLbcY

---

Title: Optimization with Access to Auxiliary Information

Authors: El Mahdi Chayti, Sai Praneeth Karimireddy

Abstract: We investigate the fundamental optimization question of minimizing a \emph{target} function $f(x)$, whose gradients are expensive to compute or have limited availability, given access to some \emph{auxiliary} side function $h(x)$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, etcetera. We propose two generic new algorithms that apply in all these settings; we also prove that we can benefit from this framework under the Hessian similarity assumption between the target and side information. A benefit is obtained when this similarity measure is small; we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.

URL: https://openreview.net/forum?id=kxYqgSkH8I

---

Title: Navigating Noise: A Study of How Noise Influences Generalisation and Calibration of Neural Networks

Authors: Martin Ferianc, Ondrej Bohdal, Timothy Hospedales, Miguel R. D. Rodrigues

Abstract: Enhancing the generalisation abilities of neural networks (NNs) through integrating noise such as MixUp or Dropout during training has emerged as a powerful and adaptable technique. Despite the proven efficacy of noise in NN training, there is no consensus regarding which noise sources, types and placements yield maximal benefits in generalisation and confidence calibration. This study thoroughly explores diverse noise modalities to evaluate their impacts on NN's generalisation and calibration under in-distribution or out-of-distribution settings, paired with experiments investigating the metric landscapes of the learnt representations, across a spectrum of NN architectures, tasks, and datasets. Our study shows that AugMix and weak augmentation exhibit cross-task effectiveness in computer vision, emphasising the need to tailor noise to specific domains. Our findings emphasise the efficacy of combining noises and successful hyperparameter transfer within a single domain but the difficulties in transferring the benefits to other domains. Furthermore, the study underscores the complexity of simultaneously optimising for both generalisation and calibration, emphasising the need for practitioners to carefully consider noise combinations and hyperparameter tuning for optimal performance in specific tasks and datasets.

URL: https://openreview.net/forum?id=zn3fB4VVF0

---

Title: On the Robustness of Neural Collapse and the Neural Collapse of Robustness

Authors: Jingtong Su, Ya Shi Zhang, Nikolaos Tsilivis, Julia Kempe

Abstract: Neural Collapse refers to the curious phenomenon in the end of training of a neural network, where feature vectors and classification weights converge to a very simple geometrical arrangement (a simplex). While it has been observed empirically in various cases and has been theoretically motivated, its connection with crucial properties of neural networks, like their generalization and robustness, remains unclear. In this work, we study the stability properties of these simplices.
We find that the simplex structure disappears under small adversarial attacks, and that perturbed examples "leap" between simplex vertices.
We further analyze the geometry of networks that are optimized to be robust against adversarial perturbations of the input, and find that Neural Collapse is a pervasive phenomenon in these cases as well, with clean and perturbed representations forming aligned simplices, and giving rise to a robust simple nearest-neighbor classifier. By studying the propagation of the amount of collapse inside the network, we identify novel properties of both robust and non-robust machine learning models, and show that earlier, unlike later layers maintain reliable simplices on perturbed data.

URL: https://openreview.net/forum?id=OyXS4ZIqd3

---

Title: Discrete Graph Auto-Encoder

Authors: Yoann Boget, Magda Gregorova, Alexandros Kalousis

Abstract: Despite advances in generative methods, accurately modeling the distribution of graphs remains a challenging task primarily because of the absence of predefined or inherent unique graph representation.
Two main strategies have emerged to tackle this issue: 1) restricting the number of possible representations by sorting the nodes, or 2) using permutation-invariant/equivariant functions, specifically Graph Neural Networks (GNNs).

In this paper, we introduce a new framework named Discrete Graph Auto-Encoder (DGAE), which leverages the strengths of both strategies and mitigate their respective limitations. In essence, we propose a strategy in 2 steps. We first use a permutation-equivariant auto-encoder to convert graphs into sets of discrete latent node representations, each node being represented by a sequence of quantized vectors. In the second step, we sort the sets of discrete latent representations and learn their distribution with a specifically designed auto-regressive model based on the Transformer architecture.

Through multiple experimental evaluations, we demonstrate the competitive performances of our model in comparison to the existing state-of-the-art across various datasets. Various ablation studies support the interest of our method.

URL: https://openreview.net/forum?id=bZ80b0wb9d

---

Title: Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits

Authors: Jie Bian, Vincent Y. F. Tan

Abstract: The Indexed Minimum Empirical Divergence (IMED) algorithm is a highly effective approach that offers a stronger theoretical guarantee of the asymptotic optimality compared to the Kullback--Leibler Upper Confidence Bound (KL-UCB) algorithm for the multi-armed bandit problem. Additionally, it has been observed to empirically outperform UCB-based algorithms and Thompson Sampling. Despite its effectiveness, the generalization of this algorithm to contextual bandits with linear payoffs has remained elusive. In this paper, we present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms. We demonstrate that LinIMED provides a $\widetilde{O}(d\sqrt{T})$ upper regret bound where $d$ is the dimension of the context and $T$ is the time horizon. Furthermore, extensive empirical studies reveal that LinIMED and its variants outperform widely-used linear bandit algorithms such as LinUCB and Linear Thompson Sampling in some regimes.

URL: https://openreview.net/forum?id=wE9kpJSemv

---

Title: ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers

Authors: Junjie Yin, Jiahao Dong, Yingheng Wang, Christopher De Sa, Volodymyr Kuleshov

Abstract: We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 2-bit and 3-bit LLMs for the first time---leveraging state-of-the-art 2-bit QuIP# quantization and 3-bit OPTQ quantization---outperforming finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, ModuLoRA attains competitive performance on text classification, natural language infernece, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release ModuLoRA together with a series of low-precision models as part of LLMTOOLS, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.

URL: https://openreview.net/forum?id=r9p9CV52MV

---


New submissions
===============


Title: RandAlign: A Parameter-Free Method for Regularizing Graph Convolutional Networks

Abstract: Studies continually find that message-passing graph convolutional networks suffer from the over- smoothing issue. Basically, the issue of over-smoothing refers to the phenomenon that the learned embeddings for all nodes can become very similar to one another and therefore are uninforma- tive after repeatedly applying message passing iterations. Intuitively, we can expect the generated embeddings become smooth asymptotically layerwisely, that is each layer of graph convolution gen- erates a smoothed version of embeddings as compared to that generated by the previous layer. Based on this intuition, we propose RandAlign, a stochastic regularization method for graph convolutional networks. The idea of RandAlign is to randomly align the learned embedding for each node with that generated by the previous layer using random interpolation in each graph convolution layer. Through alignment, the smoothness of the generated embeddings is explicitly reduced. To better maintain the benefit yielded by the graph convolution, in the alignment step we introduce to first scale the embed- ding of the previous layer to the same norm as the generated embedding and then perform random interpolation for aligning the generated embedding. RndAlign is a parameter-free method and can be directly applied without introducing additional trainable weights or hyper-parameters. We exper- imentally evaluate RandAlign on different graph domain tasks on seven benchmark datasets. The experimental results show that RandAlign is a generic method that improves the generalization per- formance of various graph convolutional network models and also improves the numerical stability of optimization, advancing the state of the art performance for graph representation learning.

URL: https://openreview.net/forum?id=cEFLfM8iyj

---

Title: GANDALF: Gated Adaptive Network for Deep Automated Learning of Features for Tabular Data

Abstract: We propose a novel high-performance, interpretable, and parameter \& computationally efficient deep learning architecture for tabular data, Gated Adaptive Network for Deep Automated Learning of Features (GANDALF). GANDALF relies on a new tabular processing unit with a gating mechanism and in-built feature selection called Gated Feature Learning Unit (GFLU) as a feature representation learning unit. We demonstrate that GANDALF outperforms or stays at-par with SOTA approaches like XGBoost, SAINT, FT-Transformers, etc. by experiments on multiple established public benchmarks. We have made available the code at https://github.com/manujosephv/pytorch\_tabular under MIT License.

URL: https://openreview.net/forum?id=OE3PPhvMXQ

---

Title: Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

Abstract: What distinguishes robust models from non-robust ones? While for ImageNet distribution shifts it has been shown that such differences in robustness can be traced back predominantly to differences in training data, so far it is not known what that translates to in terms of what the model has learned. In this work, we bridge this gap by probing the representation spaces of 16 robust CLIP vision encoders with various backbones (ResNets and ViTs) and pretraining sets (OpenAI, LAION-400M, LAION-2B, YFCC15M, CC12M and DataComp), and comparing them to the representation spaces of less robust models with identical backbones, but different (pre)training sets or objectives (CLIP pretraining on ImageNet-Captions, and supervised training or finetuning on ImageNet).
Through this analysis, we generate three novel insights.
Firstly, we detect the presence of outlier features in the robust zero-shot CLIP vision encoders, which to the best of our knowledge is the first time these are observed in non-language and non-transformer models.
Secondly, we find the existence of outlier features to be a signature of ImageNet shift robustness in models, since we only find them in robust models in our analysis.
Lastly, we also investigate the number of unique encoded concepts in the representation space and find zero-shot CLIP models to encode a higher number of unique concepts in their representation space. However, we find this to be rather a signature of language supervision than a signature of ImageNet shift robustness.

URL: https://openreview.net/forum?id=1SCptTFtmV

---

Title: Neural Clamping: Joint Input Perturbation and Temperature Scaling for Neural Network Calibration

Abstract: Neural network calibration is an essential task in deep learning to ensure consistency between the confidence of model prediction and the true correctness likelihood. In this paper, we propose a new post-processing calibration method called $\textbf{Neural Clamping}$, which employs a simple joint input-output transformation on a pre-trained classifier via a learnable universal input perturbation and an output temperature scaling parameter. Moreover, we provide theoretical explanations on why Neural Clamping is provably better than temperature scaling. Evaluated on BloodMNIST, CIFAR-100, and ImageNet image recognition datasets and a variety of deep neural network models, our empirical results show that Neural Clamping significantly outperforms state-of-the-art post-processing calibration methods. The code is available at anonymous.4open.science/r/NCToolkit.

URL: https://openreview.net/forum?id=qSFToMqLcq

---

Title: VQ-learning: Towards Unbiased Action Value Estimation in Reinforcement Learning

Abstract: Q-learning as a well-known Reinforcement Learning algorithm is prone to overestimation of action values in stochastic settings. Such an overestimation is mainly due to the use of the max operator when updating the Q function. Deep Q-learning (DQN) suffers from the same problem which is further aggravated by noisy learning environment, and can lead to substantial degradation of reward performance. In this work, we introduce a simple yet effective method called VQ-learning, along with the extended version using function approximation, called Deep VQ-Networks (DVQN), which regulates the estimation of action values and effectively tackles the issue of biased value estimation. While Double Q-learning has been proposed to tackle the same issue, we showcase that VQ-learning provides better sample efficiency, even when the overestimation bias preconditions are eliminated. We also evaluate DVQN on Atari-100k benchmark and demonstrate that DVQN consistently outperforms Deep Q-learning, Deep Double Q-learning, Clipped Deep Double Q-learning, Averaged DQN and Dueling Deep Q-learning in terms of reward performance and sample efficiency. Moreover, our experimental results show that DVQN serves as a backbone network better than DQN, when combined with an additional representation learning objective.

URL: https://openreview.net/forum?id=kDhPx1k4fb

---

Title: Federated Graph Learning with Graphless Clients

Abstract: Federated graph learning is tasked with training machine learning models, such as Graph Neural Networks (GNNs), for multiple clients, each with its own graph data. Existing methods usually assume that each client has both node features and graph structure of its graph data. In real-world scenarios, however, there exist federated learning systems where only a part of the clients have such data while other clients graphless clients may only have features. This naturally leads to a novel problem in federated graph learning: how to jointly train a model over distributed graph data with graphless clients? To tackle this problem, we propose a novel Federated Graph Structure Learning (FedGSL) framework in this paper. In FedGSL, we devise a local graph learner on each graphless client which learns the local graph structure with the structure knowledge transferred from other clients. To enable structure knowledge transfer, we design a GNN model and a feature encoder on each client. During local training, the feature encoder retains the local graph structure knowledge together with the GNN model via knowledge distillation, and the structure knowledge is transferred among clients in global update. Our extensive experiments on five real-world graph datasets demonstrate the superiority of FedGSL over other five federated learning approaches.

URL: https://openreview.net/forum?id=mVAp0eDfyR

---

Title: Learning $k$-Level Structured Sparse Neural Networks Using Group Envelope Regularization

Abstract: \begin{abstract}
The extensive need for computational resources poses a significant obstacle to deploying large-scale Deep Neural Networks (DNN) on devices with constrained resources. At the same time, studies have demonstrated that a significant number of these DNN parameters are redundant and extraneous. In this paper, we introduce a novel approach for learning structured sparse neural networks, aimed at bridging the DNN hardware deployment challenges. We develop a novel regularization technique, termed Weighted Group Sparse Envelope Function (WGSEF), generalizing the Sparse Envelop Function (SEF), to select (or nullify) neuron groups, thereby reducing redundancy and enhancing computational efficiency. The method speeds up inference time and aims to reduce memory demand and power consumption, thanks to its adaptability which lets any hardware specify group definitions, such as filters, channels, filter shapes, layer depths, a single parameter (unstructured), etc. The properties of the WGSEF allow to pre-define the desired sparsity level that would be achieved at the training convergence while maintaining negligible network accuracy degradation, or even improvement in case of redundant parameters. Our method efficiently computes the WGSEF regularizer and its proximal operator, in a worst-case linear complexity relative to the number of group variables. Employing a proximal-gradient-based optimization technique, to train the model, it tackles the non-convex minimization problem incorporating the neural network loss and the WGSEF. Finally, we experiment and illustrate the efficiency of our proposed method in terms of the compression ratio, accuracy, and inference latency.

URL: https://openreview.net/forum?id=XPLXYr7NlR

---

Title: I-ASIDE: Interpreting Global Perturbation Robustness through the Lens of Axiomatic Spectral Importance Decomposition

Abstract: Understanding model perturbation robustness mechanisms is critical for global interpretability. In this research, we present a model-agnostic interpretability method to interpret perturbation robustness mechanisms: Image Axiomatic Spectral Importance Decomposition Explanation (I-ASIDE). I-ASIDE aims to interpret model perturbation robustness mechanisms through the lens of the predictive powers of robust features and non-robust features within an information theory framework. This research is motivated by two key aspects. First, previous perturbation robustness metrics such as mean corruption errors (mCE) fall short in providing further interpretations regarding robustness mechanisms. Second, we notice that the spectral signal-to-noise ratios (SNR) of perturbed natural images exponentially decay over the frequency. This power-law-like decay implies that: low-frequency signals are generally more robust than high-frequency signals -- yet high classification accuracy can not be achieved by low-frequency signals alone. By deploying Shapley value theory, we quantify the predictive powers of robust features and non-robust features in decisions with an axiomatic approach. Our method provides a unique insight into model robustness mechanisms within an information theory framework. We conduct extensive experiments over a variety of vision foundation models to show that I-ASIDE can not only measure the perturbation robustness but also provide interpretations of its mechanisms.

URL: https://openreview.net/forum?id=uQYomAuo7M

---

Title: Graph Cuts with Arbitrary Size Constraints Through Optimal Transport

Abstract: A common way of partitioning graphs is through minimum cuts. One drawback of classical minimum cut methods is that they tend to produce small groups, which is why more balanced variants such as normalized and ratio cuts have seen more success. However, we believe that with these variants, the balance constraints can be too restrictive for some applications like for clustering of imbalanced datasets, while not being restrictive enough for when searching for perfectly balanced partitions. Here, we propose a new graph cut algorithm for partitioning graphs under arbitrary size constraints. We formulate the graph cut problem as a Gromov-Wasserstein with a concave regularizer problem. We then propose to solve it using accelerated proximal GD algorithm which has global convergence guarantees, results in sparse solutions and only incurs an additional ratio of $\mathcal{O}(\log(n))$ compared to the classical spectral clustering algorithm but was seen to be more efficient.

URL: https://openreview.net/forum?id=UG7rtrsuaT

---

Title: Deconfounding Imitation Learning with Variational Inference

Abstract: Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent. This is because partial observability gives rise to hidden confounders in the causal graph. In previous work, to work around the confounding problem, policies have been trained using query access to the expert’s policy or inverse reinforcement learning (IRL). However, both approaches have drawbacks as the expert’s policy may not be available and IRL can be unstable in practice. Instead, we propose to train a variational inference model to infer the expert’s latent information and use it to train a latent-conditional policy. We prove that using this method, under strong assumptions, the identification of the correct imitation learning policy is theoretically possible from expert demonstrations alone. In practice, we focus on a setting with less strong assumptions where we use exploration data for learning the inference model. We show in theory and practice that this algorithm converges to the correct interventional policy, solves the confounding issue, and can under certain assumptions achieve an asymptotically optimal imitation performance.

URL: https://openreview.net/forum?id=3FsVtsISHW

---

Title: A replica analysis of under-bagging

Abstract: Under-bagging (UB), which combines under sampling and bagging, is a popular ensemble learning method for training classifiers on an imbalanced data. Using bagging to reduce the increased variance caused by the reduction in sample size due to under sampling is a natural approach. However, it has recently been pointed out that in generalized linear models, naive bagging, which does not consider the class imbalance structure, and ridge regularization can produce the same results. Therefore, it is not obvious whether it is better to use UB, which requires an increased computational cost proportional to the number of under-sampled data sets, when training linear models. Given such a situation, in this study, we heuristically derive a sharp asymptotics of UB and use it to compare with several other standard methods for learning from imbalanced data, in the scenario where a linear classifier is trained from a two-component mixture data. The methods compared include the under-sampling (US) method, which trains a model using a single realization of the subsampled data, and the simple weighting (SW) method, which trains a model with a weighted loss on the entire data. It is shown that the performance of UB is improved by increasing the size of the majority class while keeping the size of the minority fixed, even though the class imbalance can be large, especially when the size of the minority class is small. This is in contrast to US, whose performance does not change as the size of the majority class increases, and SW, whose performance decreases as the imbalance increases. These results are different from the case of the naive bagging when training generalized linear models without considering the structure of the class imbalance, indicating the intrinsic difference between the ensembling and the direct regularization on the parameters.

URL: https://openreview.net/forum?id=7HIOUZAoq5

---

Title: CountCLIP - [Re] Teaching CLIP to Count to Ten

Abstract: Large vision-language models (VLMs) are shown to learn rich joint image-text represen- tations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of ‘Teaching CLIP to Count to Ten’ (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We contribute to the existing methods by improving the model’s performance on a smaller sub- set of their training data with lower computational resources. We verify these claims by reproducing their study with our own open-source code. The implementation can be found at https://anonymous.4open.science/r/CountCLIP-FA07.

URL: https://openreview.net/forum?id=BLnqs4jro9

---

Title: Revisiting Deep Feature Reconstruction for Logical and Structural Industrial Anomaly Detection

Abstract: Industrial anomaly detection is crucial for quality control and predictive maintenance but is challenging due to limited training data, varied anomaly types, and changing external factors affecting object appearances. Existing methods detect structural anomalies, such as dents and scratches, by relying on multi-scale features of image patches extracted from a deep pre-trained network. Nonetheless, extensive memory or computing requirement hinders their adoption in practice. Furthermore, detecting logical anomalies, such as images with missing or surplus elements, necessitates understanding spatial relationships beyond traditional patch-based methods. Our work focuses on Deep Feature Reconstruction (DFR), which offers a memory- and compute-efficient way of detecting structural anomalies. Moreover, we extend DFR to develop a unified framework for detecting structural and logical anomalies, called ULSAD. Specifically, we improve the training objective of DFR to enhance the capability to detect structural anomalies and introduce an attention-based loss using a global autoencoder-like network for detecting logical anomalies. Empirical results on five benchmark datasets demonstrate the effectiveness of ULSAD in the detection and localization of both structural and logical anomalies compared to eight state-of-the-art approaches. Moreover, an in-depth ablation study showcases the importance of each component in enhancing overall performance. Our code can be accessed here: https://anonymous.4open.science/r/ULSAD-2024.

URL: https://openreview.net/forum?id=kdTC4ktHPD

---

Title: Fair GANs through model rebalancing for extremely imbalanced class distributions

Abstract: Deep generative models require large amounts of training data. This often poses a problem as the collection of datasets can be expensive and difficult, in particular datasets that are representative of the appropriate underlying distribution (e.g. demographic). This introduces biases in datasets which are further propagated in the models. We present an approach to construct an unbiased generative adversarial network (GAN) from an existing biased GAN by rebalancing the model distribution. We do so by generating balanced data from an existing imbalanced deep generative model using an evolutionary algorithm and then using this data to train a balanced generative model. Additionally, we propose a bias mitigation loss function that minimizes the deviation of the learned class distribution from being equiprobable. We show results for the StyleGAN2 models while training on the Flickr Faces High Quality (FFHQ) dataset for racial fairness and see that the proposed approach improves on the fairness metric by almost 5 times, whilst maintaining image quality. We further validate our approach by applying it to an imbalanced CIFAR10 dataset where we show that we can obtain comparable fairness and image quality as when training on a balanced CIFAR10 dataset which is also twice as large. Lastly, we argue that the traditionally used image quality metrics such as Frechet inception distance (FID) are unsuitable for scenarios where the class distributions are imbalanced and a balanced reference set is not available.

URL: https://openreview.net/forum?id=koNfV2yOtL

---

Title: Mitigating Simplicity Bias in Deep Learning for Improved OOD Generalization and Robustness

Abstract: Neural networks (NNs) are known to exhibit simplicity bias where they tend to prefer learning 'simple' features over more 'complex' ones, even when the latter may be more informative. Simplicity bias can lead to the model making biased predictions which have poor out-of-distribution (OOD) generalization and subgroup robustness. To address this, we propose a hypothesis about spurious features that directly connects to simplicity bias: we hypothesize that spurious features on many datasets are simple features that are still predictive of the label. We empirically validate this hypothesis, and subsequently develop a framework which leverages this hypothesis to learn more robust models. In our proposed framework, we first train a simple model, and then regularize the conditional mutual information with respect to it to obtain the final model. We theoretically study the effect of this regularization and show that it provably reduces reliance on spurious features in certain settings. We also empirically demonstrate the effectiveness of this framework in various problem settings and real-world applications, showing that it effectively addresses simplicity bias and leads to more features being used, enhances OOD generalization, and improves subgroup robustness and fairness.

URL: https://openreview.net/forum?id=XccFHGakyU

---

Title: Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

Abstract: While LLMs excel in processing text in these human conversations, they struggle with the nuances of verbal instructions in scenarios like social navigation, where ambiguity and uncertainty can erode trust in robotic and other AI systems. We can address this shortcoming by moving beyond text and additionally focusing on the paralinguistic features of these audio responses. These features are the aspects of spoken communication that do not involve the literal wording (lexical content) but convey meaning and nuance through how something is said. We present ``Beyond Text''; an approach that improves LLM decision-making by integrating audio transcription along with a subsection of these features, which focus on the affect and more relevant in human-robot conversations.This approach not only achieves a 70.26% winning rate, outperforming existing LLMs by 22.16% to 48.30% (gemini-1.5-pro and gpt-3.5 respectively), but also enhances robustness against token manipulation adversarial attacks, highlighted by a 22.44% less decrease ratio than the text-only language model in winning rate. ``Beyond Text'' marks an advancement in social robot navigation and broader Human-Robot interactions, seamlessly integrating text-based guidance with human-audio-informed language models.

URL: https://openreview.net/forum?id=ojWtq4n7Ag

---

Title: Attribute Graphs Underlying Generative Models: Path to Learning with Limited Data

Abstract: Training generative models that capture rich semantics of the data and interpreting the latent representations encoded by such models are very important problems in un-/self-supervised learning. In this work, we provide a simple algorithm that relies on perturbation experiments on latent codes of a pre-trained generative autoencoder to uncover an attribute graph that is implied by the generative model. We perform perturbation experiments to check for influence of a given latent variable on a subset of attributes. Given this, we show that one can fit an effective graphical model that models a structural equation model between latent codes taken as exogenous variables and attributes taken as observed variables. One interesting aspect is that a single latent variable controls multiple overlapping subsets of attributes unlike conventional approaches that try to impose full independence. Using a pre-trained generative autoencoder trained on a large dataset of small molecules, we demonstrate that the graphical model between various molecular attributes and latent codes learned by our algorithm can be used to predict a specific property for molecules which are drawn from a different distribution. We compare prediction models trained on various feature subsets chosen by simple baselines, as well as existing causal discovery and sparse learning/feature selection methods, with the ones in the derived Markov blanket from our method. Results show empirically that the predictor that relies on our Markov blanket attributes is robust to distribution shifts when transferred or fine-tuned with a few samples from the new distribution, especially when training data is limited.

URL: https://openreview.net/forum?id=APON4bslQC

---

Title: GraphMaker: Can Diffusion Models Generate Large Attributed Graphs?

Abstract: Large-scale graphs with node attributes are increasingly common in various real-world applications. Creating synthetic, attribute-rich graphs that mirror real-world examples is crucial, especially for sharing graph data for analysis and developing learning models when original data is restricted to be shared. Traditional graph generation methods are limited in their capacity to handle these complex structures. Recent advances in diffusion models have shown potential in generating graph structures without attributes and smaller molecular graphs. However, these models face challenges in generating large attributed graphs due to the complex attribute-structure correlations and the large size of these graphs. This paper introduces a novel diffusion model, GraphMaker, specifically designed for generating large attributed graphs. We explore various combinations of node attribute and graph structure generation processes, finding that an asynchronous approach more effectively captures the intricate attribute-structure correlations. We also address scalability issues through edge mini-batching generation. To demonstrate the practicality of our approach in graph data dissemination, we introduce a new evaluation pipeline. The evaluation demonstrates that synthetic graphs generated by GraphMaker can be used to develop competitive graph machine learning models for the tasks defined over the original graphs without actually accessing these graphs, while many leading graph generation methods fall short in this evaluation.

URL: https://openreview.net/forum?id=0q4zjGMKoA

---

Title: Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals

Abstract: Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs – Principal Mask Proposals – decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.

URL: https://openreview.net/forum?id=UawaTQzfwy

---

Title: Re-Thinking Inverse Graphics With Large Language Models

Abstract: Inverse graphics -- the task of inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics. Disentangling an image into its constituent elements, such as the shape, color, and material properties of the objects of the 3D scene that produced it, requires a comprehensive understanding of the environment. This requirement limits the ability of existing carefully engineered approaches to generalize across domains. Inspired by the zero-shot ability of large language models (LLMs) to generalize to novel contexts, we investigate the possibility of leveraging the broad world knowledge encoded in such models in solving inverse-graphics problems. To this end, we propose the Inverse-Graphics Large Language Model (IG-LLM), an inverse-graphics framework centered around an LLM, that autoregressively decodes a visual embedding into a structured, compositional 3D-scene representation. We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training. Through our investigation, we demonstrate the potential of LLMs to facilitate inverse graphics through next-token prediction, without the use of image-space supervision. Our analysis opens up new possibilities for precise spatial reasoning about images that exploit the visual knowledge of LLMs. We will release our code and data to ensure the reproducibility of our investigation and to facilitate future research.

URL: https://openreview.net/forum?id=u0eiu1MTS7

---

Title: Gaussian-Smoothed Sliced Probability Divergences

Abstract: Gaussian smoothed sliced Wasserstein distance has been recently introduced for comparing probability distributions, while preserving privacy on the data. It has been shown that it provides performances similar to its non-smoothed (non-private) counterpart. However, the computational and statistical properties of such a metric have not yet been well-established. This work investigates the theoretical properties of this distance as well as those of generalized versions denoted as Gaussian-smoothed sliced divergences $\gssd^p$. We first show that smoothing and slicing preserve the metric property and the weak topology. To study the sample complexity of such divergences, we then introduce $\hat{\hat\mu}_{n}$ the { double empirical distribution} for the smoothed-projected $\mu$. The distribution $\hat{\hat\mu}_{n}$ is a result of a double sampling process: one from sampling according to the origin distribution $\mu$ and the second according to the convolution of the projection of $\mu$ on the unit sphere and the Gaussian smoothing. We particularly focus on the Gaussian smoothed sliced Wasserstein distance $\gssw^p$ and prove that it converges with a rate $O(n^{-1/2})$. We also derive other properties, including continuity, of different divergences with respect to the smoothing parameter. We support our theoretical findings with empirical studies in the context of privacy-preserving domain adaptation.

URL: https://openreview.net/forum?id=weuALLWUV2

---

Title: Policy optimization in reinforcement learning for column generation

Abstract: Column generation (CG) is essential for addressing large-scale linear integer programming problems in many industrial domains. While its importance is evident, the CG algorithms face convergence issues, and several heuristic algorithms have been developed to address these challenges. However, few machine learning and reinforcement learning methods are available that enhance the existing CG algorithm. This paper introduces a new policy optimization RL framework to improve the existing DQN-based CG framework, particularly training time, called \textbf{PPO-CG}.
When applied to the Cutting Stock Problems (CSP), our approach requires merely \textbf{20\%} of the training time observed with the DQN-based method and only \textbf{35\%} in Vehicle Routing Problems with Time Windows (VRPTW).
In addition, our approach suggests a novel method of node selection problem in the framework of reinforcement learning on graphs.

URL: https://openreview.net/forum?id=Y3ReoM4NhO

---

Title: Efficient Action Robust Reinforcement Learning with Probabilistic Policy Execution Uncertainty

Abstract: Robust reinforcement learning (RL) aims to find a policy that optimizes the worst-case performance in the face of uncertainties. In this paper, we focus on action robust RL with the probabilistic policy execution uncertainty, in which, instead of always carrying out the action specified by the policy, the agent will take the action specified by the policy with probability $1-\rho$ and an alternative adversarial action with probability $\rho$. We show the existence of an optimal policy on the action robust MDPs with probabilistic policy execution uncertainty and provide the action robust Bellman optimality equation for its solution. Based on that, we develop Action Robust Reinforcement Learning with Certificates (ARRLC) algorithm that achieves minimax optimal regret and sample complexity. Furthermore, we conduct numerical experiments to validate our approach's robustness, demonstrating that ARRLC outperforms non-robust RL algorithms and converges faster than the other action robust RL algorithms in the presence of action perturbations.

URL: https://openreview.net/forum?id=9sZsjfZV3q

---

Title: Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning

Abstract: Hypergraphs are powerful tools for modeling complex interactions across various domains, including biomedicine. However, learning meaningful node representations from hypergraphs remains a challenge. Existing supervised methods often lack generalizability, thereby limiting their real-world applications. We propose a new method, Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning (PhyGCN), which leverages hypergraph structure for self-supervision to enhance node representations. PhyGCN introduces a unique training strategy that integrates variable hyperedge sizes with self-supervised learning, enabling improved generalization to unseen data. Applications on multi-way chromatin interactions and polypharmacy side-effects demonstrate the effectiveness of PhyGCN. As a generic framework for high-order interaction datasets with abundant unlabeled data, PhyGCN holds strong potential for enhancing hypergraph node representations across various domains.

URL: https://openreview.net/forum?id=0VWXWPmctm

---

Title: Peer Rank and Discussion Improve Large Language Model based Evaluations

Abstract: Nowadays, the quality of responses generated by different modern large language models (LLMs) is hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs for reference-free evaluation of open-ended question answering. More specifically, they use the recognized “strongest” LLM as the evaluator, which conducts pairwise comparisons of candidate models’ answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho & MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose (1) the peer rank (PR) algorithm that takes into account each peer LLM’s pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on the preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model’s name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.

URL: https://openreview.net/forum?id=YVD1QqWRaj

---

Title: Effective, Stable and Efficient Unsupervised Image Outlier Detection via Distance Ensemble Learning

Abstract: To automatically and efficiently identify whether visual systems involve outliers (anomalies) is an important research topic. Although there has been rapid progress in the efficacy of unsupervised image outlier detection, the instability and complexity of the state-of-the-art (SOTA) methods is still a notable challenge. In this work, we explain the instability problem derived from the mainstream single method-fits-multiple scenarios paradigm, which results in performance fluctuations across different target dataset domains and varying outlier ratios. Therefore, ensembling multiple methods seems necessary. Nevertheless, traditional ensemble learners such as stacking and boosting are less effective without any supervision and are often time-consuming. Such that, we introduce a novel and lightweight distance ensemble learning (DEL) framework featuring self-selection strategies over a series of distance-based methods. Specifically, by exploring a specific property of the high-dimensional space, we propose the normalized Euclidean distance relative to the mean of the target dataset as a reliable baseline. Building upon this baseline method, we enhance it with a conditional bilateral distance metric to achieve stability across diverse dataset domains at low outlier ratios. Furthermore, to address the mean-shift problem encountered by the advanced baseline at high outlier ratios, we integrate it with a high-ratio specific distance transformer, called Shell-Re. This subsequent integration effectively mitigates the advanced baseline's instability across a wide range of outlier ratios. Overall, our approach achieves SOTA results on various challenging benchmarks while offering inference speeds that are orders of magnitude faster.

URL: https://openreview.net/forum?id=6wKI8IISgn

---

Title: Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

Abstract: Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (Discffusion), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via a new attention-based prompt learning to perform image-text matching. By comparing Discffusion with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

URL: https://openreview.net/forum?id=GtnipgAomT

---

Title: Mechanistic Interpretability for AI Safety - A Review

Abstract: As artificial intelligence (AI) systems rapidly advance, understanding their inner workings is crucial for ensuring alignment with human values and safety. This review explores mechanistic interpretability, which aims to reverse-engineer the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts, focusing on a granular, causal understanding of how AI models operate.
We establish foundational concepts, including features as units encoding knowledge within neural activations and hypotheses surrounding their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety. We examine benefits in understanding, control, and alignment while discussing risks like capability gains and dual-use concerns.
We examine the challenges of scalability, automation, and comprehensive understanding. We advocate for future work clarifying core concepts, setting rigorous standards, scaling up techniques to handle complex models and behaviors, and expanding the scope to domains like vision and reinforcement learning.

URL: https://openreview.net/forum?id=ePUVetPKu6

---

Title: SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

Abstract: As the size of large language models continue to scale, so does the computational resources required to run them. Spiking Neural Networks (SNNs) have emerged as an energy-efficient approach to deep learning that leverage sparse and event-driven activations to reduce the computational overhead associated with model inference. While they have become competitive with non-spiking models on many computer vision tasks, SNNs have proven to be more challenging to train. As a result, their performance lags behind modern deep learning, and until now, SNNs have yet to succeed at language generation on large-scale datasets. In this paper, inspired by the Receptance Weighted Key Value (RWKV) language model, we successfully implement `SpikeGPT', a generative language model with binary, event-driven spiking activation units. We train the proposed model on two model variants: 46M and 216M parameters. To the best of our knowledge, SpikeGPT is the largest backpropagation-trained SNN model when released, rendering it suitable for both the generation and comprehension of natural language. We achieve this by modifying the transformer block to replace multi-head self-attention to reduce quadratic computational complexity $\mathcal{O}(T^2)$ to linear complexity $\mathcal{O}(T)$ with increasing sequence length. Input tokens are instead streamed in sequentially to our attention mechanism (as with typical SNNs). Our experiments show that SpikeGPT remains competitive with non-spiking models on tested benchmarks, while maintaining 32.2$\times$ fewer operations when processed on neuromorphic hardware that can leverage sparse, event-driven activations.

URL: https://openreview.net/forum?id=gcf1anBL9e

---

Title: QGen: On the Ability to Generalize in Quantization Aware Training

Abstract: Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received
little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the
sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.

URL: https://openreview.net/forum?id=xAgfPcZJNL

---

Title: PCNN: Probable-Class Nearest-Neighbor Explanations Improve Fine-Grained Image Classification Accuracy for AIs and Humans

Abstract: Nearest neighbors (NN) are traditionally used to compute final decisions, e.g., in Support Vector Machines or k-NN classifiers, and to provide users with explanations for the model's decision.
In this paper, we show a novel utility of nearest neighbors: To improve predictions of a frozen, pretrained classifier C.
We leverage an image comparator S that (1) compares the input image with NN images from the top-K most probable classes; and (2) uses S's output scores to weight the confidence scores of C.
Our method consistently improves fine-grained image classification accuracy on CUB-200, Cars-196, and Dogs-120.
Also, a human study finds that showing lay users our probable-class nearest neighbors (PCNN) improves their decision accuracy over prior work which only shows only the top-1 class examples.

URL: https://openreview.net/forum?id=OcFjqiJ98b

---

Title: Automated and Unbiased Coefficient Clustering with Non Convex SLOPE

Abstract: This work studies the problem of sparse structured generalized linear models with sorted nonsmooth penalties, which are known to induce an automatic grouping of the features without a priori.
Generalizing the Sorted L1 Penalty (SLOPE), we introduce a family of nonconvex sorted penalties which not only promote clustering of variables, but are less biased than their popular convex counterpart.
For sorted weakly convex penalties (e.g. sorted MCP and SCAD), we provide an algorithm that exactly and efficiently computes their proximal operator.
Moreover, we show that a slight modification of this algorithm turns out to be remarkably efficient to tackle the computation of the proximal operator of sorted $\ell_q$ with $ q \in \left]0,1\right[$, which is not weakly convex and whose prox yields a challenging combinatorial problem.
We demonstrate the interest of using such penalties on several experiments.

URL: https://openreview.net/forum?id=9MJNEz0V13

---

Title: Unsupervised Domain Adaptation by Learning Using Privileged Information

Abstract: Successful unsupervised domain adaptation is guaranteed only under strong assumptions such as covariate shift and overlap between input domains. The latter is often violated in high-dimensional applications like image classification which, despite this limitation, continues to serve as inspiration and benchmark for algorithm development. In this work, we show that training-time access to side information in the form of auxiliary variables can help relax restrictions on input variables and increase the sample efficiency of learning at the cost of collecting a richer variable set. As this information is assumed available only during training, not in deployment, we call this problem unsupervised domain adaptation by learning using privileged information (DALUPI). To solve this problem, we propose a simple two-stage learning algorithm, inspired by our analysis of the expected error in the target domain, and a practical end-to-end variant for image classification. We propose three evaluation tasks based on classification of entities in photos and anomalies in medical images with different types of available privileged information (binary attributes and single or multiple regions of interest). We demonstrate across these tasks that using privileged information in learning can reduce errors in domain transfer compared to baselines, be robust to spurious correlations in the source domain, and increase sample efficiency.

URL: https://openreview.net/forum?id=saV3MPH0kw

---

Title: Read Between the Layers: Leveraging Intra-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models

Abstract: We address the Continual Learning (CL) problem, wherein a model must learn a sequence of tasks from non-stationary distributions while preserving prior knowledge upon encountering new experiences. With the advancement of foundation models, CL research has pivoted from the initial learning-from-scratch paradigm towards utilizing generic features from large-scale pre-training. However, existing approaches to CL with pre-trained models primarily focus on separating class-specific features from the final representation layer and neglect the potential of intermediate representations to capture low- and mid-level features, which are more invariant to domain shifts. In this work, we propose LayUP, a new prototype-based approach to continual learning that leverages second-order feature statistics from multiple intermediate layers of a pre-trained network. Our method is conceptually simple, does not require access to prior data, and works out of the box with any foundation model. LayUP surpasses the state of the art in four of the seven class-incremental learning benchmarks, all three domain-incremental learning benchmarks and in six of the seven online continual learning benchmarks, while significantly reducing memory and computational requirements compared to existing baselines. Our results demonstrate that fully exhausting the representational capacities of pre-trained models in CL goes well beyond their final embeddings. The code will be made publicly available upon acceptance.

URL: https://openreview.net/forum?id=ZTcxp9xYr2

---

Title: Multivariate Dense Retrieval: A Reproducibility Study

Abstract: The current paradigm in dense retrieval is to represent queries and passages as low-dimensional real-valued vectors using neural language models, and then compute query-passage similarity as the dot product of these vector representations. A limitation of this approach is that these learned representations cannot capture or express uncertainty. At the same time, information retrieval over large corpora contains several sources of uncertainty, such as misspelled or ambiguous text. Consequently, retrieval methods that incorporate uncertainty estimation are more likely to generalize well to such data distribution shifts. The multivariate representation learning (MRL) framework proposed by Zamani & Bendersky (2023) is the first method that works in the direction of modeling uncertainty in dense retrieval. This framework represents queries and passages as multivariate normal distributions, and computes query-passage similarity as the negative Kullback-Leibler (KL) divergence between these distributions. Furthermore, MRL formulates KL divergence as a dot product, allowing for efficient first-stage retrieval using standard maximum inner product search.

In this paper, we attempt to reproduce the MRL framework for dense retrieval by Zamani & Bendersky (2023).
We find that the original work (i) introduces a mathematical error early in the formulation of the method that propagates to the rest of the original paper's mathematical formulations, (ii) does not provide all of the necessary information to facilitate reproducibility, and (iii) proposes a training setup to train MRL that, if followed, does not yield the reported performance. In light of the aforementioned, we correct the mathematical error, make some reasonable design choices, and propose an improved training setup that complements the original paper by filling in important details that were unspecified. We further contribute a thorough ablation study which is absent from the original paper, to gain more insight into the impact of the framework's different components. Despite our efforts, we were neither able to reproduce the exact results reported in the original paper, nor to uncover the reported trends against the baselines. Our analysis offers insights as to why that is the case. Most importantly, our empirical results suggest that the definition of variance in MRL does not consistently capture uncertainty. The source code for our reproducibility study is available at: https://anonymous.4open.science/r/multivariate_ir_code_release-AB26.

URL: https://openreview.net/forum?id=1u9WOhpISC

---

Title: InvariantStock: Learning Invariant Features for Mastering the Shifting Market

Abstract: Accurately predicting stock returns is crucial for effective portfolio management. However, existing methods often overlook a fundamental issue in the market, namely, distribution shifts, making them less practical for predicting future markets or newly listed stocks. This study introduces a novel approach to address this challenge by focusing on the acquisition of invariant features across various environments, thereby enhancing robustness against distribution shifts. Specifically, we present InvariantStock, a designed learning framework comprising two key modules: an environment-aware prediction module and an environment-agnostic module. Through the designed learning of these two modules, the proposed method can learn invariant features across different environments in a straightforward manner, significantly improving its ability to handle distribution shifts in diverse market settings. Our results demonstrate that the proposed InvariantStock not only delivers robust and accurate predictions but also outperforms existing baseline methods in both prediction tasks and backtesting within the dynamically changing markets of China and the United States.

URL: https://openreview.net/forum?id=dtNEvUOZmA

---

Title: Towards Understanding Variants of Invariant Risk Minimization through the Lens of Calibration

Abstract: Machine learning models traditionally assume that training and test data are independently and identically distributed. However, in real-world applications, the test distribution often differs from training. This problem, known as out-of-distribution (OOD) generalization, challenges conventional models. Invariant Risk Minimization (IRM) emerges as a solution that aims to identify invariant features across different environments to enhance OOD robustness. However, IRM's complexity, particularly its bi-level optimization, has led to the development of various approximate methods. Our study investigates these approximate IRM techniques, employing the Expected Calibration Error (ECE) as a key metric to measure the extent to which the model has acquired invariant features. ECE, which measures the reliability of model prediction, serves as an indicator of whether models effectively capture environment-invariant features. Through a comparative analysis of datasets with distributional shifts, we observe that Information Bottleneck-based IRM, which condenses representational information, achieves a balance in improving ECE while preserving accuracy relatively. This finding is pivotal, demonstrating a feasible path to maintaining robustness without compromising accuracy. Nonetheless, our experiments also caution against over-regularization, which can diminish accuracy. This underscores the necessity for a systematic approach in evaluating OOD generalization metrics, which goes beyond mere accuracy to address the nuanced interplay between accuracy and calibration.

URL: https://openreview.net/forum?id=9YqacugDER

---

Title: Complex-based Ligand-Binding Proteins Redesign by Equivariant Diffusion-based Generative Models

Abstract: Proteins, serving as the fundamental architects of biological processes, interact with ligands to perform a myriad of functions essential for life. The design and optimization of ligand-binding proteins are pivotal for advancing drug development and enhancing therapeutic efficacy. In this study, we introduce ProteinReDiff, a novel computational framework designed to revolutionize the redesign of ligand-binding proteins. Distinguished by its utilization of Equivariant Diffusion-based Generative Models and advanced computational modules, ProteinReDiff enables the creation of high-affinity ligand-binding proteins without the need for detailed structural information, leveraging instead the potential of initial protein sequences and ligand SMILES strings. Our thorough evaluation across sequence diversity, structural preservation, and ligand binding affinity underscores ProteinReDiff's potential to significantly advance computational drug discovery and protein engineering. We will release our data and source code upon acceptance.

URL: https://openreview.net/forum?id=oZrBmyFKmr

---

Title: SFT: Sampling-based Foundational Transformer

Abstract: The extraordinary success of transformers as a sequence processing model is hindered by two things: the quadratic complexity of self-attention modules and the difficulty of transformer training. In this paper, we introduce two mechanisms aiming to alleviate these two mentioned problems: a novel neural-guided down-sampling for self-attention and a new attention non-linearity with linear-scaled and convex characteristics. Those two procedures not only speed up the self-attention computation but also greatly ease the pain of meticulous hyper-parameter tuning. Moreover, our relative positional encoding procedure applies to many types of data structures as well as special restraints, such as rotational invariance (i.e. for 3D point clouds). It is important to emphasize that our model is a foundation model that can work with multiple types of data structures including point clouds, graphs, and long-range sequences. As a foundation model, we achieved competitive results on many data structures against specialized ones in standard benchmarks, while being faster and more efficient in inference than other state-of-the-art baselines. We release our source code in supplementary materials.

URL: https://openreview.net/forum?id=m4eD6HDGGX

---

Title: Bayesian optimization with derivatives acceleration

Abstract: Bayesian optimization algorithms form an important class of methods to minimize functions that are costly to evaluate, which is a very common situation. These algorithms iteratively infer Gaussian processes from past observations of the function and decide where new observations should be made through the maximization of an acquisition criterion.
Often, the objective function is defined on a compact set such as in a hyper-rectangle of a $d$-dimensional real space,
and the bounds are chosen wide enough so that the optimum is inside the search domain.
In this situation, this work provides a way to integrate in the acquisition criterion the a priori information that
these functions, once modeled as GP trajectories, should be evaluated at their minima, and not at any point as usual acquisition criteria do. We propose an adaptation of the widely used Expected Improvement acquisition criterion that
accounts only for GP trajectories where the first order partial derivatives are zero and the Hessian matrix is positive definite.
The new acquisition criterion keeps an analytical, computationally efficient, expression.
This new acquisition criterion is found to improve Bayesian optimization on a test bed of functions made of Gaussian process trajectories in dimensions 2, 3 and 5. The addition of first and second order derivative information is particularly useful for multimodal functions.

URL: https://openreview.net/forum?id=JRjD0YF3Yd

---

Title: Top-GAP: Integrating Size Priors in CNNs for more Robustness, Interpretability, and Bias Mitigation

Abstract: In the field of computer vision, convolutional neural networks (CNNs) have shown remarkable capabilities and are excelling in various tasks from image classification to semantic segmentation. However, their vulnerability to adversarial attacks remains a pressing issue that limits their use in safety-critical domains. In this paper, we present Top-GAP -- a method that aims to increase the native robustness of CNNs by restricting the spatial size of feature representations. The advantage of our approach over common adversarial training is that our method does not degrade in clean accuracy or training speed. On CIFAR-10 with PGD $\epsilon=8/255$ and $20$ iterations, we achieve over 50\% robust accuracy while retaining the original clean accuracy. Moreover, our size constraint helps to generate sparser and less noisy class activation maps, which significantly improves object localization and mitigates potential biases. We demonstrate on a variety of datasets and architectures that our method has comparable clean accuracy to regular trained models while improving localization and robustness. In addition, our method provides the ability to incorporate prior human knowledge about object sizes into the network, which is particularly beneficial in biological and medical domains where the variance in object sizes is not dominated by perspective projections.

URL: https://openreview.net/forum?id=T58Zu9jmBP

---

Title: Deep Kernel Learning of Nonlinear Latent Force Models

Abstract: Scientific processes are often modelled by sets of differential equations. As datasets grow, individually fitting these models and quantifying their uncertainties becomes a computationally challenging task. Latent force models offer a mathematically-grounded balance between data-driven and mechanistic inference in such dynamical systems, whilst accounting for stochasticity in observations and parameters. However, the required derivation and computation of the posterior kernel terms over a low-dimensional latent force is rarely tractable, requiring approximations for complex scenarios such as nonlinear dynamics. In this paper, we overcome this issue by posing the problem as learning the solution operator itself to a class of latent force models, thereby improving the scalability of these models. This is achieved by employing a deep kernel along with a meta-learned embedding of the output functions. Finally, we demonstrate the ability to extrapolate a solution operator trained on simulations to real experimental datasets, as well as scaling to large datasets.

URL: https://openreview.net/forum?id=CNJIpI4Gb9

---

Title: LLM Agents can Autonomously Hack Websites

Abstract: In recent years, large language models (LLMs) have become increasingly capable and can now interact with tools (i.e., call functions), read documents, and recursively call themselves. As a result, these LLMs can now function autonomously as agents. With the rise in capabilities of these agents, recent work has speculated on how LLM agents would affect cybersecurity. However, not much is known about the offensive capabilities of LLM agents.

In this work, we show that LLM agents can autonomously hack websites, performing tasks as complex as blind database schema extraction and SQL injections without human feedback. Importantly, the agent does not need to know the vulnerability beforehand. This capability is uniquely enabled by frontier models that are highly capable of tool use and leveraging extended context. Namely, we show that GPT-4 is capable of such hacks, but existing open-source models are not. Finally, we show that GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild. Our findings raise questions about the widespread deployment of LLMs.

URL: https://openreview.net/forum?id=6xubl2J2VP

---

Title: Learning to Solve Integer Linear Programs with Davis-Yin Splitting

Abstract: In many applications, a combinatorial problem must be repeatedly solved with similar, but distinct parameters. Yet, the parameters $w$ are not directly observed; only contextual data $d$ that correlates with $w$ is available. It is tempting to use a neural network to predict $w$ given $d$. However, training such a model requires reconciling the discrete nature of combinatorial optimization with the gradient-based frameworks used to train neural networks. When the problem in question is an Integer Linear Program (ILP), one approach to overcome this training issue is to consider a continuous relaxation of the combinatorial problem. While existing methods utilizing this approach have shown to be highly effective on small problems, they do not always scale well to large problems. In this work, we draw on ideas from modern convex optimization to design a network and training scheme which scales effortlessly to problems with thousands of variables. Our experiments verify the computational advantage our proposed method enjoys on two representative problems, namely the shortest path problem and the knapsack problem.

URL: https://openreview.net/forum?id=H8IaxrANWl

---

Title: Graph Harmony: Denoising and Nuclear-Norm Wasserstein Adaptation for Enhanced Domain Transfer in Graph-Structured Data

Abstract: Graph-structured data can be found in numerous domains, yet the scarcity of labeled instances hinders its effective utilization of deep learning in many scenarios. Traditional unsupervised domain adaptation (UDA) strategies for graphs primarily hinge on adversarial learning and pseudo-labeling. These approaches fail to effectively leverage graph discriminative features, leading to class mismatching and unreliable label quality. To address these obstacles, we developed the Denoising and Nuclear-Norm Wasserstein Adaptation Network (DNAN). DNAN employs the Nuclear-norm Wasserstein discrepancy (NWD), which can simultaneously achieve domain alignment and class distinction. It also integrates a denoising mechanism via a Variational Graph Autoencoder. This denoising mechanism helps capture essential features of both source and target domains, improving the robustness of the domain adaptation process. Our comprehensive experiments demonstrate that DNAN outperforms state-of-the-art methods on standard UDA benchmarks for graph classification.

URL: https://openreview.net/forum?id=CSv7GgKHb6

---

Title: Cross-domain Adaptation for Few-shot 3D Shape Generation

Abstract: Realistic and diverse 3D shape generation is helpful for a wide variety of applications such as virtual reality, gaming, and animation. Modern generative models learn from large-scale datasets and generate new samples following similar distributions. However, when training data is limited, deep neural generative networks overfit and tend to replicate training samples. Prior works focus on few-shot image generation to produce high-quality and diverse results using a few target images. Unfortunately, abundant 3D shape data is typically hard to obtain as well. In this work, we make the first attempt to realize few-shot 3D shape adaptation by adapting generative models pre-trained on large source domains to target domains. To relieve overfitting and keep considerable diversity, we propose to maintain the probability distributions of the pairwise relative distances between adapted samples at feature-level and shape-level during domain adaptation. Our approach only needs the silhouettes of few-shot target samples as training data to learn target geometry distributions and achieve generated shapes with diverse topology and textures. Moreover, we introduce several metrics to evaluate generation quality and diversity. The effectiveness of our approach is demonstrated qualitatively and quantitatively under a series of few-shot 3D shape adaptation setups.

URL: https://openreview.net/forum?id=WhsTr9IPAX

---

Title: Conservative Evaluation of Offline Policy Learning

Abstract: The world offers unprecedented amounts of data in real-world domains, from which we can develop successful decision-making systems. It is possible for reinforcement learning (RL) to learn control policies offline from such data but challenging to deploy an agent during learning in safety-critical domains. Offline RL learns from historical data without access to an environment. Therefore, we need a methodology for estimating how a newly-learned agent will perform when deployed in the real environment \emph{before} actually deploying it. To achieve this, we propose a framework for conservative evaluation of offline policy learning (CEOPL). We focus on being conservative so that the probability that our agent performs below a baseline is approximately $\delta$, where $\delta$ specifies how much risk we are willing to accept. In our setting, we assume access to a data stream, split into a train-set to learn an offline policy, and a test-set to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrap confidence intervals. A lower-bound estimate allows us to decide when to deploy our learned policy with minimal risk of overestimation. We demonstrate CEOPL on a range of tasks as well as real-world medical data.

URL: https://openreview.net/forum?id=kLo4TKh0OP

---

Title: Finding Adversarially Robust Graph Lottery Tickets

Abstract: Graph Lottery Tickets (GLTs), comprising a sparse graph neural network (GNN) and a sparse input graph adjacency matrix, can significantly reduce the inference compute footprint compared to their dense counterparts. However, their performance against adversarial attacks remains to be fully explored. In this paper, we first investigate the resilience of GLTs against different structure perturbation attacks and observe that they are vulnerable and show a large drop in classification accuracy. We then present an \emph{adversarially robust graph sparsification (ARGS)} framework that prunes the adjacency matrix and the GNN weights by optimizing a novel loss function capturing the graph homophily property and information associated with both the true labels of the train nodes and the pseudo labels of the test nodes. By iteratively applying ARGS to prune both the perturbed graph adjacency matrix and the GNN model weights, we can find adversarially robust graph lottery tickets that are highly sparse yet achieve competitive performance under different training-time structure attacks. Evaluations conducted on various benchmarks, considering different poisoning structure attacks such as PGD, MetaAttack, PR-BCD, GR-BCD, and adaptive attacks, demonstrate that the GLTs generated by ARGS can significantly improve their robustness, even when subjected to high levels of sparsity.

URL: https://openreview.net/forum?id=PX06pUVs1P

---

Title: G-TRACER: Expected Sharpness Optimization

Abstract: We propose a new regularization scheme for the optimization of deep learning architectures, G-TRACER ("Geometric TRACE Ratio"), which promotes generalization by seeking minima with low mean curvature, and which has a sound theoretical basis as an approximation to a natural gradient-descent based optimization of a generalized Bayes objective. By augmenting the loss function with a G-TRACER penalty, which can be interpreted as the metric trace of the Hessian (the Laplace-Beltrami operator) with respect to the Fisher information metric, curvature-regularized optimizers (e.g. SGD-TRACER and Adam-TRACER) are simple to implement as modifications to existing optimizers and don't require extensive tuning. We show that the method can be interpreted as penalizing, in the neighborhood a minimum, the difference between the mean value of the loss and the value at the minimum, in a way that adjusts for the natural geometry of the parameter space induced by the KL divergence. We show that the method converges to a neighborhood (depending on the regularization strength) of a local minimum of the unregularized objective, and demonstrate promising performance on a number of benchmark computer vision and NLP datasets, with a particular focus on challenging problems characterized by low a signal-to-noise ratio, or an absence of natural data augmentations and other regularization schemes.

URL: https://openreview.net/forum?id=aPUTYZ7GiS

---

Title: Scaling Up Bayesian Neural Networks with Neural Networks

Abstract: Bayesian Neural Networks (BNNs) offer a principled and natural framework for proper uncertainty quantification in the context of deep learning. They address the typical challenges associated with conventional deep learning methods, such as data insatiability, ad-hoc nature, and susceptibility to overfitting. However, their implementation typically either relies on Markov chain Monte Carlo (MCMC) methods, which are characterized by their computational intensity and inefficiency in a high-dimensional space, or variational inference methods, which tend to underestimate uncertainty. To address this issue, we propose a novel calibration-Emulation-Sampling (CES) strategy to significantly enhance the computational efficiency of BNN. In this framework, during the initial calibration stage, we collect a small set of samples from the parameter space. These samples serve as training data for the emulator, which approximates the map between parameters and posterior probability. The trained emulator is then used for sampling from the posterior distribution at substantially higher speed compared to the standard BNN. Using simulated and real data, we demonstrate that our proposed method improves computational efficiency of BNN, while maintaining similar performance in terms of prediction accuracy and uncertainty quantification.

URL: https://openreview.net/forum?id=cD209UgOX7

---

Title: PAC Privacy Preserving Diffusion Models

Abstract: Data privacy protection is garnering increased attention among researchers. Diffusion models (DMs), particularly with strict differential privacy, can potentially produce images with both high privacy and visual quality. However, challenges arise such as in ensuring robust protection in privatizing specific data attributes, areas where current models often fall short. To address these challenges, we introduce the PAC Privacy Preserving Diffusion Model, a model leverages diffusion principles and ensure Probably Approximately Correct (PAC) privacy. We enhance privacy protection by integrating a private classifier guidance into the Langevin Sampling Process. Additionally, recognizing the gap in measuring the privacy of models, we have developed a novel metric to gauge privacy levels. Our model, assessed with this new metric and supported by Gaussian matrix computations for the PAC bound, has shown superior performance in privacy protection over existing leading private generative models according to benchmark tests.

URL: https://openreview.net/forum?id=jjQTE2ayrX

---

Title: A Semi-Bayesian Nonparametric Estimator of the Maximum Mean Discrepancy Measure: Applications in Goodness-of-Fit Testing and Generative Adversarial Networks

Abstract: A classic inferential statistical problem is the goodness-of-fit (GOF) test. Performing such tests can be challenging when the hypothesized parametric model has an intractable likelihood and its distributional form is not available. Bayesian methods for GOF testing can be appealing due to their ability to incorporate expert knowledge through prior distributions. However, standard Bayesian methods for this test often require strong distributional assumptions on the data and their relevant parameters. To address this issue, we propose a semi-Bayesian nonparametric (semi-BNP) procedure based on the maximum mean discrepancy (MMD) measure that can be applied to the GOF test. We introduce a novel Bayesian estimator for the MMD, which enables the development of a measure-based hypothesis test for intractable models. Through extensive experiments, we demonstrate that our proposed test outperforms frequentist MMD-based methods by achieving a lower false rejection and acceptance rate of the null hypothesis. Furthermore, we showcase the versatility of our approach by embedding the proposed estimator within a generative adversarial network (GAN) framework. It facilitates a robust BNP learning approach as another significant application of our method. With our BNP procedure, this new GAN approach can enhance sample diversity and improve inferential accuracy compared to traditional techniques.

URL: https://openreview.net/forum?id=lUnlHS1FYT

---

Title: Gromov-Wassertein-like Distances in the Gaussian Mixture Models Space

Abstract: The Gromov-Wasserstein (GW) distance is frequently used in machine learning to compare distributions across distinct metric spaces. Despite its utility, it remains computationally intensive, especially for large-scale problems. Recently, a novel Wasserstein distance specifically tailored for Gaussian mixture models and known as $ MW_2 $ (mixture Wasserstein) has been introduced by several authors. In scenarios where data exhibit clustering, this approach simplifies to a small-scale discrete optimal transport problem, which complexity depends solely on the number of Gaussian components in the GMMs. This paper aims to extend $ MW_2 $ by introducing new Gromov-type distances. These distances are designed to be isometry-invariant in Euclidean spaces and are applicable for comparing GMMs across different dimensional spaces. Our first contribution is the Mixture Gromov Wasserstein distance ($MGW_2$), which can be viewed as a ’Gromovized’ version of $ MW_2 $ . This new distance has a straightforward discrete formulation, making it highly efficient for estimating distances between GMMs in practical applications. To facilitate the derivation of a transport plan between GMMs, we present a second distance, the Embedded Wasserstein distance ($ EW_2 $). This distance turns out to be closely related to several recent alternatives to Gromov-Wasserstein. We show that can be adapted to derive a distance as well as optimal transportation plans between GMMs. We demonstrate the efficiency of these newly proposed distances on medium to large-scale problems, including shape matching and hyperspectral image color transfer.

URL: https://openreview.net/forum?id=7t7fJT4Gym

---

Title: Gaussian Process Spatial Clustering

Abstract: Spatial clustering is a common unsupervised learning problem with many applications in areas such as public health, urban planning or transportation, where the goal is to identify clusters of similar locations based on regionalization as well as patterns in characteristics over those locations. Unlike standard clustering, a well-studied area with a rich literature including methods such as K-means clustering, spectral clustering, and hierarchical clustering, spatial clustering is a relatively sparse area of study due to inherent di erences between the spatial domain of the data and its corresponding covariates. In the case of our motivating example, the American Community Survey dataset, spatial di erences in census tract regions cannot be directly compared to di erences in participant survey responses to indicators such as employment status or income. As such, in this paper, we develop a spatial clustering algorithm called Gaussian Process Spatial Clustering (GPSC), which clusters functions between data leveraging the flexibility of Gaussian processes and extends it to the case of clustering geospatial data. We provide theoretical guarantees and demonstrate its capabilities to recover true clusters in several simulation studies and a real-world dataset to identify clusters of tracts in North Carolina based on socioeconomic and environmental indicators associated with health and cancer risk.

URL: https://openreview.net/forum?id=qe0hLlEAqg

---

Title: Double Descent and Other Interpolation Phenomena in GANs

Abstract: We study overparameterization in generative adversarial networks (GANs) that can interpolate the training data. We show that overparameterization can improve generalization performance and accelerate the training process. We study the generalization error as a function of latent space dimension and identify two main behaviors, depending on the learning setting. First, we show that overparameterized generative models that learn distributions by minimizing a metric or $f$-divergence do not exhibit double descent in generalization errors; specifically, all the interpolating solutions achieve the same generalization error. Second, we develop a novel pseudo-supervised learning approach for GANs where the training utilizes pairs of fabricated (noise) inputs in conjunction with real output samples. Our pseudo-supervised setting exhibits double descent (and in some cases, triple descent) of generalization errors. We combine pseudo-supervision with overparameterization (i.e., overly large latent space dimension) to accelerate training while matching or even surpassing generalization performance without pseudo-supervision. While our analysis focuses mostly on linear models, we also apply important insights for improving generalization of nonlinear, multilayer GANs.

URL: https://openreview.net/forum?id=kewMtmcfWv

---

Title: Continuous-time Particle Filtering for Latent Stochastic Differential Equations

Abstract: Particle filtering is a standard Monte-Carlo approach for a wide range of sequential inference tasks. The key component of a particle filter is a set of particles with importance weights that serve as a proxy of the true posterior distribution of some stochastic process. In this work, we propose continuous latent particle filters, an approach that extends particle filtering to the continuous-time domain of latent neural stochastic differential equations. We demonstrate how continuous latent particle filters can be used as a generic plug-in replacement for inference techniques relying on a learned variational posterior. Our experiments with different model families based on latent neural stochastic differential equations demonstrate superior performance of continuous-time particle filtering in inference tasks like likelihood estimation and sequential prediction for a variety of synthetic and real-world data.

URL: https://openreview.net/forum?id=7saHXoW1T1

---

Title: Multiset Transformer: Advancing Representation Learning in Persistence Diagrams

Abstract: To improve persistence diagram representation learning, we propose Multiset Transformer. This is the first neural network that utilizes attention mechanisms specifically designed for multisets as inputs and offers rigorous theoretical guarantees of permutation invariance. The architecture integrates multiset-enhanced attentions with a pool-decomposition scheme, allowing multiplicities to be preserved across equivariant layers. This capability enables full leverage of multiplicities while significantly reducing both computational and spatial complexity compared to the Set Transformer. Additionally, our method can greatly benefit from clustering as a preprocessing step to further minimize complexity, an advantage not possessed by the Set Transformer. Experimental results demonstrate that the Multiset Transformer outperforms existing neural network methods in the realm of persistence diagram representation learning.

URL: https://openreview.net/forum?id=1KPfhL102s

---

Title: Towards General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks

Abstract: The integration of deep learning systems into healthcare has been hindered by the resource-intensive process of data annotation and the inability of these systems to generalize to different data distributions. Foundation models, which are models pre-trained on large datasets, have emerged as a solution to reduce reliance on annotated data and enhance model generalizability and robustness. DINOv2 is an open-source foundation model pre-trained with self-supervised learning on 142 million curated natural images that exhibits promising capabilities across various vision tasks. Nevertheless, a critical question remains unanswered regarding DINOv2's adaptability to radiological imaging, and whether its features are sufficiently general to benefit radiology image analysis. Therefore, this study comprehensively evaluates the performance DINOv2 for radiology, conducting over 200 experiments across diverse modalities (X-ray, CT, and MRI). To measure the effectiveness and generalizability of DINOv2's feature representations, we analyze the model across medical image analysis tasks including disease classification and organ segmentation on both 2D and 3D images, and under different settings like kNN, few-shot learning, linear-probing, end-to-end fine-tuning, and parameter-efficient fine-tuning. Comparative analyses with established supervised, self-supervised, and weakly-supervised models reveal DINOv2's superior performance and cross-task generalizability. The findings contribute insights to potential avenues for optimizing pre-training strategies for medical imaging and enhancing the broader understanding of DINOv2's role in bridging the gap between natural and radiological image analysis. Our code is available at \href{https://github.com/MohammedSB/DINOv2ForRadiology}{https://github.com/MohammedSB/DINOv2ForRadiology}

URL: https://openreview.net/forum?id=YnEYpYRvkL

---

Title: Attacking Bayes: On the Adversarial Robustness of Bayesian Neural Networks

Abstract: Adversarial examples have been shown to cause neural networks to fail on a wide range of vision and language tasks, but recent work has claimed that {\em Bayesian} neural networks (BNNs) are inherently robust to adversarial perturbations. In this work, we examine this claim. To study the adversarial robustness of BNNs, we investigate whether it is possible to successfully break state-of-the-art BNN inference methods and prediction pipelines using even relatively unsophisticated attacks for three tasks: (1) label prediction under the posterior predictive mean, (2) adversarial example detection with Bayesian predictive uncertainty, and (3) semantic shift detection. We find that BNNs trained with state-of-the-art approximate inference methods, and even BNNs trained with Hamiltonian Monte Carlo, are highly susceptible to adversarial attacks. We also identify various conceptual and experimental errors in previous works that claimed inherent adversarial robustness of BNNs and conclusively demonstrate that BNNs and uncertainty-aware Bayesian prediction pipelines are {\em not} inherently robust against adversarial attacks.

URL: https://openreview.net/forum?id=C6wj17VBnu

---

Title: Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning

Abstract: Existing value-based algorithms for cooperative multi-agent reinforcement learning (MARL) commonly rely on random exploration, such as $\epsilon$-greedy, to explore the environment. However, such exploration is inefficient at finding effective joint actions in states that require cooperation of multiple agents. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to seamlessly extend value-based MARL algorithms with ensembles of value functions. EMAX leverages the ensemble of value functions to guide the exploration of agents, stabilises their optimisation, and makes their policies more robust to miscoordination. These benefits are achieved by using a combination of three techniques. (1) EMAX uses the uncertainty of value estimates across the ensemble in a UCB policy to guide the exploration. This exploration policy focuses on parts of the environment which require cooperation across agents and, thus, enables agents to more efficiently learn how to cooperate. (2) During the optimisation, EMAX computes target values as average value estimates across the ensemble. These targets exhibit lower variance compared to commonly applied target networks, leading to significant benefits in MARL which commonly suffers from high variance caused by the exploration and non-stationary policies of other agents. (3) During evaluation, EMAX selects actions following a majority vote across the ensemble, which reduces the likelihood of selecting sub-optimal actions. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 60%, 47%, and 539%, respectively, averaged across 21 tasks.

URL: https://openreview.net/forum?id=YSNBwMPrXm

---

Title: The Cold Posterior Effect Indicates Underfitting, and Cold Posteriors Represent a Fully Bayesian Method to Mitigate It

Abstract: The cold posterior effect (CPE) (Wenzel et al., 2020) in Bayesian deep learning shows that, for posteriors with a temperature $T<1$, the resulting posterior predictive could have better performance than the Bayesian posterior ($T=1$). As the Bayesian posterior is known to be optimal under perfect model specification, many recent works have studied the presence of CPE as a model misspecification problem, arising from the prior and/or from the likelihood. In this work, we provide a more nuanced understanding of the CPE as we show that \emph{misspecification leads to CPE only when the resulting Bayesian posterior underfits}. In fact, we theoretically show that if there is no underfitting, there is no CPE. Furthermore, we show that these \emph{tempered posteriors} with ($T < 1$) are indeed proper Bayesian posteriors with a different combination of likelihood and prior parameterized by $T$. Within the \textit{empirical Bayes} framework, this observation validates the adjustment of the temperature hyperparameter $T$ as a straightforward approach to mitigate underfitting in the Bayesian posterior. In essence, we show that by fine-tuning the temperature $T$ we implicitly utilize alternative Bayesian posteriors, albeit with less misspecified likelihood and prior distributions.

URL: https://openreview.net/forum?id=GZORXGxHHT

---

Title: Random Projection Variational Auto-Encoders

Abstract: Variational Auto-Encoders optimise the parameters of a distribution that approximates the posterior distribution of some data. We focus on the case where the approximating distribution includes Gaussian distributions related to each datum. When these Gaussians are each defined on a high-dimensional space, it is often assumed that using full-rank covariance matrices would be prohibitively computationally expensive and would be prone to overfitting. In such settings, a parameterisation that constrains each covariance matrix to be diagonal is often adopted. We propose the use of approximations that offer the potential for alternative compromises between the computational expense, overfitting and accuracy of full-rank and diagonal covariances. More specifically, we propose using covariance matrices that involve a random projection of a full-rank covariance in a low-dimensional space. In this ablation study, we isolate the varying parameterisation from other techniques and assess the impact of the dimensionality of this low-dimensional space on both computational cost and accuracy in the context of MNIST, CIFAR-10 and Flowers-102. We observe that, for a finite number of training iterations, accuracy is maximised by the compromise that is neither equivalent to the full-rank covariance nor a diagonal covariance. We also identify that the computational cost fluctuates less than one might anticipate and that performance is improved with the parameterisation considering a random projection from a lower full-rank covariance.

URL: https://openreview.net/forum?id=ix5SQjwyrv

---

Title: Non-Cross Diffusion for Semantic Consistency

Abstract: In diffusion models, deviations from a straight generative flow are a common issue, resulting in semantic inconsistencies and suboptimal generations. To address this challenge, we introduce Non-Cross Diffusion, an innovative approach in generative modeling for learning ordinary differential equation (ODE) models. Our methodology strategically incorporates an ascending dimension of input to effectively connect points sampled from two distributions with uncrossed paths. This design ensures enhanced semantic consistency throughout the inference process, which is especially critical for applications reliant on consistent generative flows, including distillation methods and deterministic sampling, which are fundamental in image editing and interpolation tasks.

Our empirical results demonstrate the effectiveness of Non-Cross Diffusion, showing a substantial reduction in semantic inconsistencies at different inference steps and a notable enhancement in the overall performance of diffusion models.

URL: https://openreview.net/forum?id=7eYwoELDg7

---

Title: Multi-Grid Tensorized Fourier Neural Operator for High- Resolution PDEs

Abstract: Memory complexity and data scarcity have so far prohibited learning solution operators of partial differential equations (PDE) at high resolutions. We address these limitations by introducing a new data-efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization, called multi-grid tensorized neural operator (MG-TFNO). MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena, through a decomposition of both the input domain and the operator’s parameter space. Our contributions are threefold: i) we enable parallelization over input samples with a novel multi-grid-based domain decomposition, ii) we represent the parameters of the model in a high-order latent subspace of the Fourier domain, through a global tensor factorization, resulting in an extreme reduction in the number of parameters and improved generalization, and iii) we propose architectural improvements to the backbone FNO. Our approach can be used in any operator learning setting. We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150× compression. The tensorization combined with the domain decomposition, yields over 150× reduction in the number of parameters and 7× reduction in the domain size without losses in accuracy, while enabling parallelism.

URL: https://openreview.net/forum?id=AWiDlO63bH

---

Title: Towards context and domain-aware algorithms for scene analysis

Abstract: Interpersonal interactions and social situations in multimedia content encompass a rich blend of visual, textual, audio and contextual cues as well. However, contextual data integration in multimodal scene analysis research has often been overlooked, leading to incomplete interpretations. For instance, recognizing that two combatants in a video are positioned within a designated ring with a dedicated referee drastically alters the perception from a simple scuffle to a structured martial arts contest.

This paper presents an innovative approach to scene analysis in video content, which not only incorporates contextual data but also emphasizes the most significant features during training. Additionally, we introduce a methodology for integrating domain knowledge into our framework. We evaluate our proposed methodology using two comprehensive datasets, demonstrating promising results compared to a baseline study using one of the datasets. These findings underscore the importance of integrating contextual data into multimodal video analysis, while also recognizing the challenges associated with their utilization.

URL: https://openreview.net/forum?id=JQGmbVK4Fr

---

Title: Membership Inference Attacks and Privacy in Topic Modeling

Abstract: Recent research shows that large language models are susceptible to privacy attacks that infer aspects of the training data. However, it is unclear if simpler generative models, like topic models, share similar vulnerabilities. In this work, we propose an attack against topic models that can confidently identify members of the training data in Latent Dirichlet Allocation. Our results suggest that the privacy risks associated with generative modeling are not restricted to large neural models. Additionally, to mitigate these vulnerabilities, we explore differentially private (DP) topic modeling. We propose a framework for private topic modeling that incorporates DP vocabulary selection as a pre-processing step, and show that it improves privacy while having limited effects on practical utility.

URL: https://openreview.net/forum?id=NmWp5lFL7L

---

Title: Coordinate Transform Fourier Neural Operators for Symmetries in Physical Modelings

Abstract: Symmetries often arise in many natural sciences; rather than relying on data augmentation or regularization for learning these symmetries, incorporating these inherent symmetries directly into the neural network architecture simplifies the learning process and enhances model performance. The laws of physics, including partial differential equations (PDEs), remain unchanged regardless of the coordinate system employed to depict them, and symmetries sometimes can be natural to illuminate in other coordinate systems. Moreover, symmetries often are associated with the underlying domain shapes. In this work, we consider physical modelings with neural operators (NOs), and we propose an approach based on coordinate transforms (CT) to work on different domain shapes and symmetries. The resulting CT-FNO scheme barely increases computational complexity and generalizes well across different domain shapes while respecting the symmetries.

URL: https://openreview.net/forum?id=pMD7A77k3i

---

Title: Benchmarking Offline Reinforcement Learning in Factorisable Action Spaces

Abstract: Extending reinforcement learning (RL) to offline contexts is a promising prospect, particularly in sectors where data collection poses substantial challenges or risks. Pivotal to the success of transferring RL offline is mitigating overestimation bias in value estimates for state-action pairs absent from data. Whilst numerous approaches have been proposed in recent years, these tend to focus primarily on continuous or small-scale discrete action spaces. Factorised discrete action spaces, on the other hand, have received relatively little attention, despite many real-world problems naturally having factorisable actions. In this work, we undertake an initial formative investigation into offline reinforcement learning in factorisable action spaces. Using value-decomposition as formulated in DecQN as a foundation, we present the case for a factorised approach from both a theoretical and practical perspective, and conduct an extensive empirical evaluation of several offline techniques adapted to the factorised setting. In the absence of established benchmarks, we introduce a suite of our own based on a discretised variant of the DeepMind Control Suite, comprising datasets of varying quality and task complexity. Advocating for reproducible research and innovation, we make all datasets available for public use, alongside our code base.

URL: https://openreview.net/forum?id=fm679EfNqc

---

Title: Pixel-wise Agricultural Image Time Series Classification: Comparisons and a Deformable Prototype-based Approach

Abstract: Improvements in Earth observation by satellites allow for imagery of ever higher temporal and spatial resolution. Leveraging this data for agricultural monitoring is key for addressing environmental and economic challenges. Current methods for crop segmentation using temporal data either rely on annotated data or are heavily engineered to compensate the lack of supervision. In this paper, we present and compare datasets and methods for both supervised and unsupervised pixel-wise segmentation of satellite image time series (SITS). We also introduce an approach to add invariance to spectral deformations and temporal shifts to classical prototype-based methods such as K-means and Nearest Centroid Classifier (NCC). We study different levels of supervision and show this simple and highly interpretable method achieves the best performance in the low data regime and significantly improves the state of the art for unsupervised classification of agricultural time series on four recent SITS datasets.

URL: https://openreview.net/forum?id=hSFsiTTxZr

---

Title: Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

Abstract: Downsampling operators break the shift invariance of convolutional neural networks (CNNs) and this affects the robustness of features learned by CNNs when dealing with even small pixel-level shift. Through a large-scale correlation analysis framework, we study shift invariance of CNNs by inspecting existing downsampling operators in terms of their maximum-sampling bias (MSB), and find that MSB is negatively correlated with shift invariance. Based on this crucial insight, we propose a learnable pooling operator called Translation
Invariant Polyphase Sampling (TIPS) and two regularizations on the intermediate feature maps of TIPS to reduce MSB and learn translation-invariant representations. TIPS can be integrated into any CNN and can be trained end-to-end with marginal computational
overhead. Our experiments demonstrate that TIPS results in consistent performance gains in terms of accuracy, shift consistency, and shift fidelity on multiple benchmarks for image classification and semantic segmentation compared to previous methods and also leads to improvements in adversarial and distributional robustness. TIPS results in the lowest MSB compared to all previous methods, thus explaining our strong empirical results.

URL: https://openreview.net/forum?id=c6xaAYoddZ

---

Title: Graph Reinforcement Learning for Combinatorial Optimization: A Survey and Unifying Perspective

Abstract: Graphs are a natural representation for systems based on relations between connected entities. Combinatorial optimization problems, which arise when considering an objective function related to a process of interest on discrete structures, are often challenging due to the rapid growth of the solution space. The trial-and-error paradigm of Reinforcement Learning has recently emerged as a promising alternative to traditional methods, such as exact algorithms and (meta)heuristics, for discovering better decision-making strategies in a variety of disciplines including chemistry, computer science, and statistics. Despite the fact that they arose in markedly different fields, these techniques share significant commonalities. Therefore, we set out to synthesize this work in a unifying perspective that we term Graph Reinforcement Learning, interpreting it as a constructive decision-making method for graph problems. After covering the relevant technical background, we review works along the dividing line of whether the goal is to optimize graph structure given a process of interest, or to optimize the outcome of the process itself under fixed graph structure. Finally, we discuss the common challenges facing the field and open research questions. In contrast with other surveys, the present work focuses on non-canonical graph problems for which performant algorithms are typically not known and Reinforcement Learning is able to provide efficient and effective solutions.

URL: https://openreview.net/forum?id=HduK51xNtS

---

Title: Hyperparameter Selection in Continual Learning

Abstract: In continual learning (CL)—where a learner trains on a stream of data—standard hyperparameter optimisation (HPO) cannot be applied, as a learner does not have access to all of the data at the same time. This has prompted the development of CL-specific HPO frameworks. The most popular way to tune hyperparameters in CL is to repeatedly train over the whole data stream with different hyperparameter settings. However, this *end-of-training* HPO is unrealistic as in practice a learner can only see the stream once. Hence, there is an open question: *what HPO framework should a practitioner use for a CL problem in reality?* This paper answers this question by evaluating several realistic HPO frameworks. We find that all the HPO frameworks considered, including end-of-training HPO, perform similarly. We therefore advocate using the realistic and most computationally efficient method: fitting the hyperparameters on the first task and then fixing them throughout training.

URL: https://openreview.net/forum?id=IWu0Rnr09e

---

Title: Exploiting Edge Features in Graph-based Learning with Fused Network Gromov-Wasserstein Distance

Abstract: Pairwise comparison of graphs is key to many applications in Machine Learning ranging from clustering, kernel-based classification/regression and more recently supervised graph prediction. Distances between graphs usually rely on informative representations of these structured objects such as bag of substructures or other graph embeddings. A recently popular solution consists in representing graphs as metric measure spaces, allowing to successfully leverage Optimal Transport, which provides meaningful distances allowing to compare them, namely the Gromov-Wasserstein distance and its variant the fused Gromov-Wasserstein that applies on node attributed graphs. However, this family of distances overlooks edge attributes, which are essential for many structured objects. In this work, we introduce an extension of the fused Gromov-Wasserstein distance for comparing graphs whose both nodes and edges have features. We propose novel algorithms for distance and barycenter computation. We present a range of studies that illustrate the properties of the proposed distance and empirically demonstrate its effectiveness in supervised graph prediction tasks.

URL: https://openreview.net/forum?id=8uCNtJ2Fmo

---

Title: Guarantees of confidentiality via Hammersley-Chapman-Robbins bounds

Abstract: Protecting privacy during inference with deep neural networks is possible by adding noise to the activations in the last layers prior to the final classifiers or other task-specific layers. The activations in such layers are known as "features" (or, less commonly, as "embeddings" or "feature embeddings"). The added noise helps prevent reconstruction of the inputs from the noisy features. Lower bounding the variance of every possible unbiased estimator of the inputs quantifies the confidentiality arising from such added noise. Convenient, computationally tractable bounds are available from classic inequalities of Hammersley and of Chapman and Robbins -- the HCR bounds. Numerical experiments indicate that the HCR bounds are on the precipice of being effectual for small neural nets with the data sets, "MNIST" and "CIFAR-10," which contain 10 classes each for image classification. The HCR bounds appear to be insufficient on their own to guarantee confidentiality of the inputs to inference with standard deep neural nets, "ResNet-18" and "Swin-T," pre-trained on the data set, "ImageNet-1000," which contains 1000 classes. Supplementing the addition of noise to features with other methods for providing confidentiality may be warranted in the case of ImageNet. In all cases, the results reported here limit consideration to amounts of added noise that incur little degradation in the accuracy of classification from the noisy features. Thus, the added noise enhances confidentiality without much reduction in the accuracy on the task of image classification.

URL: https://openreview.net/forum?id=DOWSP7y2cu

---

Title: Improved rate for Locally Differentially Private Linear Bandits

Abstract: In this paper, we propose a stochastic linear contextual bandit algorithm that ensures local differential privacy (LDP). Our algorithm is $(\epsilon,\delta)-$Locally Differentially Private and guarantees $\tilde O\left(\sqrt{d}T^{3/4}\right)$ regret with high probability . This is a factor of $d^{1/4}$ improvement over the previous state-of-the-art (SOTA)\citep{zheng2020locally}. Furthermore, our regret guarantee improves to $\tilde O\left(\sqrt{dT}\right)$ when the action space is well-conditioned. This rate matches the optimal non-private asymptotic rate, thus demonstrating that we can achieve privacy for free even in the stringent LDP model. Our algorithm is the first algorithm that achieves $\tilde O(\sqrt{T})$ regret in a privacy setting that is stronger than the central settings.

URL: https://openreview.net/forum?id=sb5JTwoLmj

---

Title: Federated Learning Under Second-Order Data Heterogeneity

Abstract: We consider the problem of Federated Learning over clients with heterogeneous data. We propose an algorithm called SABER that samples a subset of clients and tasks each client with its own local subproblem. SABER provably reduces client drift by incorporating an estimate of the global update direction and regularization into each client's subproblem. Under second-order data heterogeneity with parameter $\delta$, we prove that the method's communication complexity for non-convex problems is $\mathcal{O}\left(\delta\varepsilon^{-2}\sqrt{M}\right)$. In addition, for problems satisfying $\mu$-Polyak-\L{}ojasiewicz condition, the method converges linearly with communication complexity of $\mathcal{O}\left(\left(\frac{\delta}{\mu}\sqrt{M} + M\right)\log\frac{1}{\varepsilon}\right)$. To showcase the empirical performance of our method, we compare it to standard baselines including FedAvg, FedProx, and SCAFFOLD on image classification problems and demonstrate its superior performance in data-heterogeneous settings.

URL: https://openreview.net/forum?id=71YWF6ZuPD

---

Title: DFML: Decentralized Federated Mutual Learning

Abstract: In the realm of real-world devices, centralized servers in Federated Learning (FL) present challenges including communication bottlenecks and susceptibility to a single point of failure. Additionally, contemporary devices inherently exhibit model and data heterogeneity. Existing work lacks a Decentralized FL (DFL) framework capable of accommodating such heterogeneity without imposing architectural restrictions or assuming the availability of additional data. To address these issues, we propose a Decentralized Federated Mutual Learning (DFML) framework that is serverless, supports nonrestrictive heterogeneous models, and avoids reliance on additional data. DFML effectively handles model and data heterogeneity through mutual learning, which distills knowledge between clients, and cyclically varying the amount of supervision and distillation signals. Extensive experimental results demonstrate consistent effectiveness of DFML in both convergence speed and global accuracy, outperforming prevalent baselines under various conditions. For example, with the CIFAR-100 dataset and 50 clients, DFML achieves a substantial increase of +17.20% and +19.95% in global accuracy under Independent and Identically Distributed (IID) and non-IID data shifts, respectively.

URL: https://openreview.net/forum?id=I9HvzJbUbh

---

Title: NuTime: Numerically Multi-Scaled Embedding for Large- Scale Time-Series Pretraining

Abstract: Recent research on time-series self-supervised models shows great promise in learning semantic representations. However, it has been limited to small-scale datasets, e.g., thousands of temporal sequences. In this work, we make key technical contributions that are tailored
to the numerical properties of time-series data and allow the model to scale to large datasets, e.g., millions of temporal sequences. We adopt the Transformer architecture by first partitioning the input into non-overlapping windows. Each window is then characterized by its
normalized shape and two scalar values denoting the mean and standard deviation within each window. To embed scalar values that may possess arbitrary numerical amplitudes to high-dimensional vectors, we propose a numerically multi-scaled embedding module enumerating all possible numerical scales for the scalars. The model undergoes pretraining with a simple contrastive objective on a large-scale dataset over a million sequences collected by merging existing public data. We study its transfer performance on a number of
univariate and multivariate classification tasks, few shot learning, unsupervised clustering and anomaly detection benchmarks. Our method exhibits remarkable improvement against previous pretraining approaches and establishes the new state of the art, even compared with domain-specific non-learning-based methods.

URL: https://openreview.net/forum?id=TwiSBZ0p9u

---

Title: Autoencoding Hyperbolic Representation for Adversarial Generation

Abstract: With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. Our architecture fosters expressive and numerically stable representation in the hyperbolic space. Theoretically, we validate the training of GAN in the hyperbolic space, and prove stability of our hyperbolic layers used in the AE. Experiments show that our model is capable of generating tree-like graphs as well as complex molecular data with state-of-the-art structure-related performance.

URL: https://openreview.net/forum?id=NQi9U0YLW3

---

Title: A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

Abstract: Video segmentation requires consistently segmenting and tracking objects over time. Due to the quadratic dependency on input size, directly applying self-attention to video segmentation with high-resolution input features poses significant challenges, often leading to GPU Out-Of-Memory errors. Consequently, modern video segmenters either extend an image segmenter without incorporating any temporal attention or resort to window space-time attention in a naive manner. In this work, we present Axial-VS, a general and simple framework that enhances video segmenters by tracking objects along axial trajectories. The framework tackles video segmentation through two sub-tasks: short-term within-clip segmentation and long-term cross-clip tracking. In the first step, Axial-VS augments an off-the-shelf clip-level video segmenter with the proposed axial-trajectory attention, sequentially tracking objects along the height- and width-trajectories within a clip, thereby enhancing temporal consistency by capturing motion trajectories. The axial decomposition significantly reduces the computational complexity for dense features, and outperforms the window space-time attention in segmentation quality. In the second step, we further employ axial-trajectory attention to the object queries in clip-level segmenters, which are learned to encode object information, thereby aiding object tracking across different clips and achieving consistent segmentation throughout the video. Without bells and whistles, Axial-VS showcases state-of-the-art results on video segmentation benchmarks, emphasizing its effectiveness in addressing the limitations of modern clip-level video segmenters. Code will be made available.

URL: https://openreview.net/forum?id=Sy6ZOStz5v

---

Title: S-TLLR: STDP-inspired Temporal Local Learning Rule for Spiking Neural Networks

Abstract: Spiking Neural Networks (SNNs) are biologically plausible models that have been identified as potentially apt for deploying energy-efficient intelligence at the edge, particularly for sequential learning tasks. However, training of SNNs poses significant challenges due to the necessity for precise temporal and spatial credit assignment. Back-propagation through time (BPTT) algorithm, whilst the most widely used method for addressing these issues, incurs high computational cost due to its temporal dependency. In this work, we propose S-TLLR, a novel three-factor temporal local learning rule inspired by the Spike-Timing Dependent Plasticity (STDP) mechanism, aimed at training deep SNNs on event-based learning tasks. Furthermore, S-TLLR is designed to have low memory and time complexities, which are independent of the number of time steps, rendering it suitable for online learning on low-power edge devices. To demonstrate the scalability of our proposed method, we have conducted extensive evaluations on event-based datasets spanning a wide range of applications, such as image and gesture recognition, audio classification, and optical flow estimation. In all the experiments, S-TLLR achieved high accuracy, comparable to BPTT, with a reduction in memory between $5-50\times$ and multiply-accumulate (MAC) operations between $1.3-6.6\times$.

URL: https://openreview.net/forum?id=CNaiJRcX84

---

Title: Meta-Sparsity: Learning Optimal Sparse Structures in Multi-task Networks through Meta-learning

Abstract: This paper presents meta-sparsity, a framework for learning model sparsity, basically learning the parameter that controls the degree of sparsity, that allows deep neural networks (DNNs) to inherently generate optimal sparse shared structures in multi-task learning (MTL) setting. This proposed approach enables the dynamic learning of sparsity patterns across a variety of tasks, unlike traditional sparsity methods that rely heavily on manual hyperparameter tuning. Inspired by Model Agnostic Meta-Learning (MAML), the emphasis is on learning shared and optimally sparse parameters in multi-task scenarios by implementing a penalty-based, channel-wise structured sparsity during the meta-training phase. This method improves the model’s efficacy by removing unnecessary parameters and enhances its ability to handle both seen and previously unseen tasks. The effectiveness of meta-sparsity is rigorously evaluated by extensive experiments on two datasets, NYU-v2 and CelebAMask-HQ, covering a broad spectrum of tasks ranging from pixel-level to image-level predictions. The results show that the proposed approach performs well across many tasks, indicating its potential as a versatile tool for creating efficient and adaptable sparse neural networks. This work, therefore, presents an approach towards learning sparsity, contributing to the efforts in the field of sparse neural networks and suggesting new directions for research towards parsimonious models.

URL: https://openreview.net/forum?id=tT0gXgiPU5

---

Title: On the Data Heterogeneity in Adaptive Federated Learning

Abstract: Adaptive federated learning, which benefits from the characteristic of both adaptive optimizer and federated training paradigm, has recently gained lots of attention. Despite achieving outstanding performances on tasks with heavy-tail stochastic gradient noise distributions, adaptive federated learning also suffers from the same data heterogeneity issue as standard federated learning: heterogeneous data distribution across the clients can largely deteriorate the convergence of adaptive federated learning. In this paper, we propose a novel adaptive federated learning framework with local gossip averaging to address this issue. Particularly, we introduce a client re-sampling mechanism and peer-to-peer gossip communications between local clients to mitigate the data heterogeneity without requiring additional gradient computation costs. We theoretically prove the fast convergence for our proposed method under non-convex stochastic settings and empirically demonstrate its superior performances over vanilla adaptive federated learning with client sampling. Moreover, we extend our framework to a communication-efficient variant, in which clients are divided into disjoint clusters determined by their connectivity or communication capabilities. We exclusively perform local gossip averaging within these clusters, leading to an enhancement in network communication efficiency for our proposed method.

URL: https://openreview.net/forum?id=hv7iXsiBZE

---

Title: Striking a Balance: An Optimal Mechanism Design for Heterogenous Differentially Private Data Acquisition for Logistic Regression

Abstract: We investigate the problem of solving ML tasks from data collected from privacy-sensitive sellers. Since the data is private, sellers must be incentivized through payments to provide their data. Thus, the goal is to design a mechanism that optimizes a weighted combination of test loss, seller privacy, and payment, i.e., strikes
a balance between getting a good privacy-preserving ML model and limiting payments to the sellers. To do this, we first solve logistic regression with known heterogeneous differential privacy guarantees. We then consider the main problem where the differential privacy requirements are decided by the buyer to balance the tradeoff between test loss and payments. To solve this problem, we use our earlier result on logistic regression with known privacy guarantees along with standard mechanism design theory to formulate an optimization problem which is nonconvex. We establish conditions under which the problem can be convexified using a change of variables technique. This insight is then harnessed to develop an algorithm that provides optimal solution. Additionally, we demonstrate the resilience of our mechanism to scenarios in which data points and privacy sensitivities are correlated. Finally, we demonstrate the utility of our algorithm by applying it to the Wisconsin breast cancer dataset.

URL: https://openreview.net/forum?id=9D4rvCnbqt

---

Title: Grid Cell-Inspired Fragmentation and Recall for Efficient Map Building

Abstract: Animals and robots navigate through environments by building and refining maps of space. These maps enable functions including navigation back to home, planning, search and foraging. Here, we use observations from neuroscience, specifically the observed fragmentation of grid cell map in compartmentalized spaces, to propose and apply the concept of Fragmentation-and-Recall (FARMap) in the mapping of large spaces. Agents solve the mapping problem by building local maps via a surprisal-based clustering of space, which they use to set subgoals for spatial exploration. Agents build and use a local map to predict their observations; high surprisal leads to a "fragmentation event" that truncates the local map. At these events, the recent local map is placed into long-term memory (LTM) and a different local map is initialized. If observations at a fracture point match observations in one of the stored local maps, that map is recalled (and thus reused) from LTM. The fragmentation points induce a natural online clustering of the larger space, forming a set of intrinsic potential subgoals that are stored in LTM as a topological graph. Agents choose their next subgoal from the set of near and far potential subgoals from within the current local map or LTM, respectively. Thus, local maps guide exploration locally, while LTM promotes global exploration. We demonstrate that FARMap replicates the fragmentation points observed in animal studies. We evaluate FARMap on complex procedurally-generated spatial environments and realistic simulations to demonstrate that this mapping strategy much more rapidly covers the environment (number of agent steps and wall clock time) and is more efficient in active memory usage, without loss of performance.

URL: https://openreview.net/forum?id=cT8oOJ6Q6F

---

Title: UCB Exploration for Fixed-Budget Bayesian Best Arm Identification

Abstract: We study best-arm identification (BAI) in the fixed-budget setting. Adaptive allocations based on upper confidence bounds (UCBs), such as UCBE, are known to work well in BAI. However, it is well-known that its optimal regret is theoretically dependent on instances, which we show to be an artifact in many fixed-budget BAI problems. In this paper we propose an UCB exploration algorithm that is both theoretically and empirically efficient for the fixed budget BAI problem under a Bayesian setting. The key idea is to learn prior information, which can enhance the performance of UCB-based BAI algorithm as it has done in the cumulative regret minimization problem. We establish bounds on the failure
probability and the simple regret for the Bayesian BAI problem, providing upper bounds of order $O(\sqrt{K/n})$, up to logarithmic factors, where $n$ represents the budget and $K$ denotes the number of arms. Furthermore, we demonstrate through empirical results that our approach consistently outperforms state-of-the-art baselines.

URL: https://openreview.net/forum?id=BqSi73krYd

---

Title: Generalization Error Bounds for Learning under Censored Feedback

Abstract: Generalization error bounds from learning theory provide statistical guarantees on how well an algorithm will perform on previously unseen data. In this paper, we characterize the impacts of data non-IIDness due to censored feedback (a.k.a. selective labeling bias) on such bounds. We first derive an extension of the well-known Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which characterizes the gap between empirical and theoretical CDFs given IID data, to problems with non-IID data due to censored feedback. We then use this CDF error bound to provide a bound on the generalization error guarantees of a classifier trained on such non-IID data. We show that existing generalization error bounds (which do not account for censored feedback) fail to correctly capture the model's generalization guarantees, verifying the need for our bounds. We further analyze the effectiveness of (pure and bounded) exploration techniques, proposed by recent literature as a way to alleviate censored feedback, on improving our error bounds. Together, our findings illustrate how a decision maker should account for the trade-off between strengthening the generalization guarantees of an algorithm and the costs incurred in data collection when future data availability is limited by censored feedback.

URL: https://openreview.net/forum?id=rvoOttpqpY

---

Title: Learning by Self-Explaining

Abstract: Current AI research mainly treats explanations as a means for model inspection. Yet, this neglects findings from human psychology that describe the benefit of self-explanations in an agent’s learning process. Motivated by this, we introduce a novel approach in the context of image classification, termed Learning by Self-Explaining (LSX). LSX utilizes aspects of self-refining AI and human-guided explanatory machine learning. The underlying idea is that a learner model, in addition to optimizing for the original predictive task, is further optimized based on explanatory feedback from an internal critic model. Intuitively, a learner’s explanations are considered “useful” if the internal critic can perform the same task given these explanations. We provide an overview of important components of LSX and, based on this, perform extensive experimental evaluations via three different example instantiations. Our results indicate improvements via Learning by Self-Explaining on several levels: in terms of model generalization, reducing the influence of confounding factors, and providing more task-relevant and faithful model explanations. Overall, our work provides evidence for the potential of self-explaining within the learning phase of an AI model.

URL: https://openreview.net/forum?id=bpjU7rLjJ7

---

Title: Memorisation in Machine Learning: A Survey of Results

Abstract: Quantifying the impact of individual data samples on machine learning models is an open research problem.
This is particularly relevant when complex and high-dimensional relationships have to be learned from a limited sample of the data generating distribution, such as in deep learning.
It was previously shown that, in these cases, models rely not only on extracting patterns which are helpful for generalisation, but also seem to be required to incorporate some of the training data more or less as is, in a process often termed memorisation.
This raises the question: if some memorisation is a requirement for effective learning, what are its privacy implications?
In this work we unify a broad range of previous definitions and perspectives on memorisation in ML, discuss their interplay with model generalisation and their implications of these phenomena on data privacy.
Moreover, we systematise methods allowing practitioners to detect the occurrence of memorisation or quantify it and contextualise our findings in a broad range of ML learning settings.
Finally, we discuss memorisation in the context of privacy attacks, differential privacy (DP) and adversarial actors.

URL: https://openreview.net/forum?id=HVWODwbrFK

---

Title: Diversity-Preserving $K$--Armed Bandits, Revisited

Abstract: We consider the bandit-based framework for diversity-preserving recommendations introduced by Celis et al. (2019), who approached it in the case of a polytope mainly by a reduction to the setting of linear bandits. We design a UCB algorithm using the specific structure of the setting and show that it enjoys a bounded distribution-dependent regret in the natural cases when the optimal mixed actions put some probability mass on all actions (i.e., when diversity is desirable). The regret lower bounds provided show that otherwise, at least when the model is mean-unbounded, a $\ln T$ regret is suffered. We also discuss an example beyond the special case of polytopes.

URL: https://openreview.net/forum?id=Viz7KBqO4A

---

Title: AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

Abstract: Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. Up until now, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, i.e., classify to the class that maximizes cosine similarity between its averaged encoded class descriptors and the image encoding. However, weighing all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot classifiers. AutoCLIP tunes per-image weights to each prompt template at inference time, based on statistics of class descriptor-image similarities. AutoCLIP is fully unsupervised, has only a minor additional computation overhead, and can be easily implemented in few lines of code. We show that AutoCLIP outperforms baselines across a broad range of vision-language models, datasets, and prompt templates consistently and by up to 3 percent point accuracy.

URL: https://openreview.net/forum?id=gVNyEVKjqf

---

Title: Nonasymptotic Laplace approximation under model misspecification

Abstract: In this note, we present non-asymptotic two-sided bounds to the log-marginal likelihood in Bayesian inference. The classical Laplace approximation is recovered as the leading term. Our derivation permits model misspecification and allows the parameter dimension to grow with the sample size. We do not make any assumptions about the asymptotic shape of the posterior, and instead require certain regularity conditions on the likelihood ratio and that the posterior is sufficiently concentrated. We envision the derived bounds to be widely applicable in establishing model selection consistency of Bayesian procedures in non-conjugate settings, especially when the true model potentially lies outside the class of candidate models considered.

URL: https://openreview.net/forum?id=oxVKvRrL7q

---

Title: Defense Against Multi-target Backdoor Attacks

Abstract: Neural Trojan/Backdoor attacks pose a significant threat to the current deep-learning-based systems and are hard to defend against due to the lack of knowledge about triggers. In this paper, we first introduce a variant of BadNet that uses multiple triggers to control multiple target classes and allows these triggers to be at any location in the input image. These features make our attack more potent and easier to be conducted in real-world scenarios. We empirically found that many well-known Trojan defenses fail to detect and mitigate our proposed attack. To defend against this attack, we then introduce an image-specific trigger reverse-engineering mechanism that uses multiple images to recover a variety of potential triggers. We then propose a detection mechanism by measuring the transferability of such recovered triggers. A Trojan trigger will have very high transferability i.e. they make other images also go to the same class. We study many practical advantages of our attack and then apply our proposed defense mechanism to a variety of image datasets. The experimental results show the superiority of our method over the state-of-the-arts.

URL: https://openreview.net/forum?id=1NkpfJgNA4

---

Title: Differentially Private Non-convex Learning for Multi-layer Neural Networks

Abstract: This paper focuses on the problem of Differentially Private Stochastic Optimization for (multi-layer)
fully connected neural networks with a single output node. In the first part, we examine cases with no hidden nodes, specifically focusing on Generalized Linear Models (GLMs). We investigate the well-specific model where the random noise possesses a zero mean, and the link function is both bounded and Lipschitz continuous. We propose several algorithms, and our analysis demonstrates the feasibility of achieving an excess population risk that remains invariant to the data dimension. We also delve into the scenario involving the ReLU link function, and our findings mirror those of the bounded link function. We conclude this section by contrasting well-specified and misspecified models, using ReLU regression as a representative example. In the second part of the paper, we extend our ideas to two-layer neural networks with sigmoid or ReLU activation functions in the well-specified model. In the third part, we study the theoretical guarantees of DP-SGD in Abadi et al. (2016) for fully connected multi-layer neural networks. By utilizing recent advances in Neural Tangent Kernel theory, we provide the first excess population risk when both the sample size and the width of the network are sufficiently large. Additionally, we discuss the role of some parameters in DP-SGD regarding their utility, both theoretically and empirically.

URL: https://openreview.net/forum?id=8FZGEIDwNj

---

Title: Controllable Text Generation in the Instruction-Tuning Era

Abstract: While most research on controllable text generation has focused on steering base Language Models, the emerging instruction-tuning and prompting paradigm offers an alternate approach to controllability. We compile and release ConGenBench, a testbed of 17 different controllable generation tasks, using a subset of it to benchmark the performance of 9 different baselines and methods on Instruction-tuned Language Models. To our surprise, we find that prompting-based approaches outperform controllable text generation methods on most datasets and tasks, highlighting a need for research on controllable text generation with Instruction-tuned Language Models in specific. Prompt-based approaches match human performance on most stylistic tasks while lagging on structural tasks, foregrounding a need to study more varied constraints and more challenging stylistic tasks. To facilitate such research, we provide an algorithm that uses only a task dataset and a Large Language Model with in-context capabilities to automatically generate a constraint dataset. This method eliminates the fields dependence on pre-curated constraint datasets, hence vastly expanding the range of constraints that can be studied in the future.

URL: https://openreview.net/forum?id=iccY277A8G

---

Title: Revisiting Feature Prediction for Learning Visual Representations from Video

Abstract: This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

URL: https://openreview.net/forum?id=QaCCuDfBk2

---

Title: A Self-Representation Learning Method for Unsupervised Feature Selection using Feature Space Basis

Abstract: Current methods of feature selection based on a self-representation framework use all the features of the original data in their representation framework. This issue carries over redundant and noisy features into the representation space, thereby diminishing the quality and effectiveness of the results. This work proposes a novel representation learning method, dubbed GRSSLFS (Graph Regularized Self-Representation and Sparse Subspace Learning), that mitigates the drawbacks of using all features. GRSSLFS employs an approach for constructing a basis for the feature space, which includes those features with the highest variance. The objective function of GRSSLFS is then developed based on a self-representation framework that combines subspace learning and matrix factorization of the basis matrix. Moreover, these basis features are incorporated into a manifold learning term to preserve the geometrical structure of the underlying data.
We provide an effectiveness and performance evaluation on several widely-used benchmark datasets. The results show that GRSSLFS achieves a high level of performance compared to several classic and state-of-the-art unsupervised feature selection methods.

URL: https://openreview.net/forum?id=LNvbgBFPMt

---

Title: Speech Separation based on pre-trained model and Deep Modularization

Abstract: Deep neural networks (DNN) have been used extensively to achieve impressive results in speech separation. Most of the DNNs implementations to speech separation relies on supervised learning which is data hungry, and success is pegged on availability of large-scale parallel clean-mixed speech pair. This kind of data is always not available since it is difficult to create hence limiting the implementation of supervised learning. Moreover, the implementation of supervised learning in speech separation requires that systems deal with the permutation problem (permutation ambiguity). This places an upper limit of the quality of separated speech that a tool can attain. To avoid the problem of permutation ambiguity, speech separation based on clustering has been proposed by some recent works. However, these clustering techniques still rely on supervised learning and therefore still require quality paired data. To deal with the problem permutation ambiguity and eliminate need for paired training dataset, we propose a fully unsupervised speech separation technique based on clustering of spectrogram points or raw speech blocks. Our technique couples the traditional graph clustering objectives and deep neural networks to achieve speech separation. We start by establishing features of spectrogram points or raw speech blocks using a pre-trained model and consequently use the features in a downstream task of clustering using deep modularization. Through this we are able to identify clusters of spectrogram points or raw speech blocks dominated all speakers in a mixed speech. We perform extensive evaluation of the proposed technique and show that it outperforms state of the art tools included in the study.

URL: https://openreview.net/forum?id=RVa6Nd99ee

---

Title: Diffusion Models with Deterministic Normalizing Flow Priors

Abstract: For faster sampling and higher sample quality, we propose DiNof (Diffusion with Normalizing f low priors), a technique that makes use of normalizing flows and diffusion models. We use normalizing flows to parameterize the noisy data at any arbitrary step of the diffusion process and utilize it as the prior in the reverse diffusion process. More specifically, the forward noising process turns a data distribution into partially noisy data, which are subsequently transformed into a Gaussian distribution by a nonlinear process. The backward denoising procedure begins with a prior created by sampling from the Gaussian distribution and applying the invertible normalizing flow transformations deterministically. To generate the data distribution, the prior then undergoes the remaining diffusion stochastic denoising procedure. Through the reduction of the number of total diffusion steps, we are able to speed up both the forward and backward processes. More importantly, we improve the expressive power of diffusion models by employing both deterministic and stochastic mappings. Experiments on standard image generation datasets demonstrate the advantage of the proposed method over existing approaches. On the unconditional CIFAR10 dataset, for example, we achieve an FID of 2.01 and an Inception score of 9.96. Our method also demonstrates competitive performance on CelebA-HQ-256 dataset as it obtains an FID score of 7.11. Code is available at https://anonymous.4open.science/r/DiNof-F2D2.

URL: https://openreview.net/forum?id=ACMNVwcR6v

---

Title: Deep Generative Models through the Lens of the Manifold Hypothesis: A Survey and New Connections

Abstract: In recent years there has been increased interest in understanding the interplay between deep generative models (DGMs) and the manifold hypothesis. Research in this area focuses on understanding the reasons why commonly-used DGMs succeed or fail at learning distributions supported on unknown low-dimensional manifolds, as well as developing new models explicitly designed to account for manifold-supported data. This manifold lens provides both clarity as to why some DGMs (e.g. diffusion models and some generative adversarial networks) empirically surpass others (e.g. likelihood-based models such as variational autoencoders, normalizing flows, or energy-based models) at sample generation, and guidance for devising more performant DGMs. We carry out the first survey of DGMs viewed through this lens, making two novel contributions along the way. First, we formally establish that numerical instability of high-dimensional likelihoods is unavoidable when modelling low-dimensional data. We then show that DGMs on learned representations of autoencoders can be interpreted as approximately minimizing Wasserstein distance: this result, which applies to latent diffusion models, helps justify their outstanding empirical results. The manifold lens provides a rich perspective from which to understand DGMs, which we aim to make more accessible and widespread.

URL: https://openreview.net/forum?id=a90WpmSi0I

---

Title: Universal Functional Regression with Neural Operator Flows

Abstract: Regression on function spaces is typically limited to models with Gaussian process priors. We introduce the notion of universal functional regression, in which we aim to learn a prior distribution over non-Gaussian function spaces that remains mathematically tractable for functional regression. To do this, we develop Neural Operator Flows (OpFlow), an infinite-dimensional extension of normalizing flows. OpFlow is an invertible operator that maps the (potentially unknown) data function space into a Gaussian process, allowing for exact likelihood estimation of functional point evaluations. OpFlow enables robust and accurate uncertainty quantification via drawing posterior samples of the Gaussian process and subsequently mapping them into the data function space. We empirically study the performance of OpFlow on regression and generation tasks with data generated from Gaussian processes with known posterior forms and non-Gaussian processes, as well as real-world earthquake seismograms with an unknown closed-form distribution.

URL: https://openreview.net/forum?id=rHL329Xa3X

---

Title: Unsupervised Similarity Learning for Spectral Clustering

Abstract: Spectral clustering has been popularized due to its ability to identify non-convex boundaries between individual clusters.
However, it requires defining a similarity metric to construct the Laplacian matrix.
Instead of predefining this metric upfront, we propose to learn it by finding the optimal parameters of a kernel function.
This learning approach parameterizes the data topology by optimizing a similarity function that assigns high similarity values to a pair of data that share discriminative features and vice versa.
While some existing approaches also learn the similarity values, they rely on hyperparameters to do so.
However, these hyperparameters cannot be validated in an unsupervised setting.
As a result, suboptimal hyperparameter values can lead to detrimental performance.
To circumvent this drawback, we propose a method that eliminates the need for hyperparameters by learning the optimal parameter for a similarity metric used in spectral clustering.
This enables unsupervised learning of the similarity metric while performing spectral clustering.
The method's capability is verified on several benchmark datasets with a large scale of non-convexity.
Our method outperforms SOTA approaches on accuracy and normalized mutual information measures up to 10$\%$ when applied to popular image and text datasets.

URL: https://openreview.net/forum?id=xtgDcssHBP

---

Title: Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism

Abstract: Neural Information Retrieval (NIR) has significantly improved upon heuristic-based Infor- mation Retrieval (IR) systems. Yet, failures remain frequent, the models used often being unable to retrieve documents relevant to the user’s query. We address this challenge by proposing a lightweight abstention mechanism tailored for real-world constraints, with par- ticular emphasis placed on the reranking phase. We introduce a protocol for evaluating abstention strategies in black-box scenarios (typically encountered when relying on API services), demonstrating their efficacy, and propose a simple yet effective data-driven mech- anism. We provide open-source code for experiment replication and abstention implemen- tation, fostering wider adoption and application in diverse contexts.

URL: https://openreview.net/forum?id=iMKUMWfRIj

---

Title: Convergence Analysis and Trajectory Comparison of Gradient Descent for Overparameterized Deep Linear Networks

Abstract: This paper presents a convergence analysis and trajectory comparison of the gradient descent (GD) method for overparameterized deep linear neural networks with different random initializations, demonstrating that the GD trajectory for these networks closely matches that of the corresponding convex optimization problem. This study touches upon one major open theoretical problem in machine learning--why deep neural networks trained with GD methods are efficient in many practical applications? While the solution of this problem is still beyond the reach of general nonlinear deep neural networks, extensive efforts have been invested in studying relevant questions for deep linear neural networks, and many interesting results have been reported to date. For example, recent results on loss landscape show that even though the loss function of deep linear neural networks is non-convex, every local minimizer is also a global minimizer. We focus on the trajectory of GD when applied to deep linear networks and demonstrate that, with appropriate initialization and sufficient width of the hidden layers, the GD trajectory closely matches that of the corresponding convex optimization problem. This result holds regardless of the depth of the network, providing insight into the efficiency of GD in the training of deep neural networks. Furthermore, we show that the GD trajectory for an overparameterized deep linear network automatically avoids bad saddle points.

URL: https://openreview.net/forum?id=jG7ndW7UHp

---

Title: Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases

Abstract: The problem of spurious correlations (SCs) arises when a classifier relies on non-predictive features that happen to be correlated with the labels in the training data. For example, a classifier may misclassify dog breeds based on the background of dog images. This happens when the backgrounds are correlated with other breeds in the training data, leading to misclassifications during test time. Previous SC benchmark datasets suffer from varying issues, e.g., over-saturation or only containing one-to-one (O2O) SCs, but no many-to-many (M2M) SCs arising between groups of spurious attributes and classes. In this paper, we present Spawrious-{O2O, M2M}-{Easy, Medium, Hard}, an image classification benchmark suite containing spurious correlations between classes and backgrounds. To create this dataset, we employ a text-to-image model to generate photo-realistic images and an image captioning model to filter out unsuitable ones. The resulting dataset is of high quality and contains approximately 152k images. Our experimental results demonstrate that state-of-the-art group robustness methods struggle with Spawrious, most notably on the Hard-splits with none of them getting over $73\%$ accuracy on the hardest split using a ResNet50 pretrained on ImageNet. By examining model misclassifications, we detect reliances on spurious backgrounds, demonstrating that our dataset provides a significant challenge.

URL: https://openreview.net/forum?id=pBOe9UQ3iT

---

Title: Reproducibility Study of “Are Your Explanations Reliable?” Investigating the Stability of LIME in Explaining Text Classifiers by Marrying XAI and Adversarial Attack

Abstract: Abstract
This work investigates the reproducibility of “Are Your Explanations Reliable?” Investigating the Stability of LIME in Explaining Text Classifiers by Marrying XAI and Adversarial Attack by Burger et al. (2023). Our objective is to replicate and verify this paper’s findings. The provided code by the authors is utilised as a foundation, missing segments and substantial additions are implemented by us. Our work suggests that the inherent instability claim is only partially reproducible due to unspecified hyperparameters in the paper. Nonetheless, we successfully reproduced and extended the results regarding the choice of RBO as similarity measure. Lastly, the third claim was partially reproducible due to constrained computational resources. However, we could verify the third claim by observing similar trends on a small subset of the test data. In conclusion, all claims are supported in varying degrees through our reproducibility study.

URL: https://openreview.net/forum?id=kKS1ygcpUF

---

Title: Enhanced Federated Optimization: Adaptive Unbiased Sampling with Reduced Variance

Abstract: Federated Learning (FL) is a distributed learning paradigm to train a global model across multiple devices without collecting local data. In FL, a server typically selects a subset of clients for each training round to optimize resource usage. Central to this process is the technique of unbiased client sampling, which ensures a representative selection of clients. Current methods primarily utilize a random sampling procedure which, despite its effectiveness, achieves suboptimal efficiency owing to the loose upper bound caused by the sampling variance. In this work, by adopting an independent sampling procedure, we propose a federated optimization framework focused on adaptive unbiased client sampling, improving the convergence rate via an online variance reduction strategy.
In particular, we present the first adaptive client sampler, K-Vib, employing an independent sampling procedure. K-Vib achieves a linear speed-up on the regret bound $\tilde{\mathcal{O}}\big(N^{\frac{1}{3}}T^{\frac{2}{3}}/K^{\frac{4}{3}}\big)$ within a set communication budget $K$. Empirical studies indicate that K-Vib doubles the speed compared to baseline algorithms, demonstrating significant potential in federated optimization.

URL: https://openreview.net/forum?id=CKQ3sMt4tx

---

Title: Sequential Best-Arm Identification with Application to P300 Speller

Abstract: A brain-computer interface (BCI) is an advanced technology that facilitates direct communication between the human brain and a computer system, by enabling individuals to interact with devices using only their thoughts. The P300 speller is a primary type of BCI system, which allows users to spell words without using a physical keyboard, but instead by capturing and interpreting brain electroencephalogram (EEG) signals under different stimulus presentation paradigms. Traditional non-adaptive presentation paradigms, however, treat each word selection as an isolated event, resulting in a lengthy learning process. To enhance efficiency, we cast the problem as a sequence of best-arm identification tasks within the context of multi-armed bandits, where each task corresponds to the interaction between the user and the system for a single character or word. Leveraging large language models, we utilize the prior knowledge learned from previous tasks to inform and facilitate subsequent tasks. We propose a sequential top-two Thompson sampling algorithm under two scenarios: the fixed-confidence setting and the fixed-budget setting. We study the theoretical property of the proposed algorithm, and demonstrate its substantial empirical improvement through both simulations as well as the data generated from a P300 speller simulator that was built upon the real BCI experiments.

URL: https://openreview.net/forum?id=QweNIIqvZf

---

Title: Self-Supervised Visual Representation Learning for Medical Image Analysis: A Comprehensive Survey

Abstract: Deep Learning has developed as a great tool to achieve satisfactory performance in many computer vision or natural language processing tasks. But supervised deep learning algorithms require a large amount of labeled data to achieve satisfactory performance. Self-supervised learning, a subcategory of unsupervised learning, circumvents the issue of the requirement of a large amount of data by learning representations from the data without labeled examples. Over the past few years, Self-Supervised Learning has been applied to various tasks to achieve performance at par with or even surpass the supervised counterparts in several tasks. However, the progress has been so rapid, that any proper account of those works is unavailable. In this study, we attempt to present a review of those methods and show how the Self-Supervised Learning paradigm evolved over the years. Along with the aforementioned objective, we also present an exhaustive review of the Self-Supervised methods applied to Medical Image Analysis. Furthermore, we also present an extensive compilation of the details of the datasets used in the different works, along with a compilation of the performance metrics of some notable works on image and video datasets.

URL: https://openreview.net/forum?id=3Wg1oErMcJ

---

Title: Dreamix: Video Diffusion Models are General Video Editors

Abstract: Text-driven image and video diffusion models have recently achieved unprecedented generation realism. While diffusion models have been successfully applied for image editing, few can edit motion in video. We present a diffusion-based method that is able to perform text-based motion and appearance editing of general, real-world videos. Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt. As maintaining high-fidelity to the original video requires retaining some of its high-resolution information, we add a preliminary stage of finetuning the model on the original video, significantly boosting fidelity. We propose to improve motion editability by using a mixed objective that jointly finetunes with full temporal attention and with temporal attention masking. We extend our method for animating images, bringing them to life by adding motion to existing or new objects, and camera movements. Extensive experiments showcase our method's remarkable ability to edit motion in videos.

URL: https://openreview.net/forum?id=v8i3Meu2Zu

---

Title: CroissantLLM: A Truly Bilingual French-English Language Model

Abstract: We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

URL: https://openreview.net/forum?id=uA19Xo1o31

---

Title: Generative Models are Self-Watermarked: Declaring Model Authentication through Re-Generation

Abstract: As machine- and AI-generated content proliferates, protecting the intellectual property of generative models has become imperative, yet verifying data ownership poses formidable challenges, particularly in cases of unauthorized reuse of generated data. Confirming the ownership of the data is challenging, as the data generation process is opaque to those verifying the authenticity. Our work is dedicated to detecting data reuse from a single sample. While watermarking has been the traditional method to detect AI-generated content by embedding specific information within models or their outputs, which could compromise the quality of outputs, our approach instead dentifies inherent fingerprints in the outputs without altering models. The verification is achieved by requiring the (authentic) models to re-generate the data. Furthermore, we propose a method that iteratively re-generates the data to enhance these fingerprints in the generation stage. The strategy is both theoretically sound and empirically proven effective with recent advanced text and image generative models. Our approach is significant because it avoids extra operations or measures, such as (1) modifying model parameters, (2) altering the generated outputs, or (3) employing additional classification models for verification. This enhancement broadens the applicability of authorship verification (1) to track the IP violation in generative models published without explicitly designed watermark mechanisms and (2) to produce outputs without compromising their quality.

URL: https://openreview.net/forum?id=LUHmWDydue

---

Title: Revisiting Energy Based Models as Policies: Ranking Noise Contrastive Estimation and Interpolating Energy Models

Abstract: A crucial design decision for any robot learning pipeline is the choice of policy representation: what type of model should be used to generate the next set of robot actions? Owing to the inherent multi-modal nature of many robotic tasks, combined with the recent successes in generative modeling, researchers have turned to state-of-the-art probabilistic models such as diffusion models for policy representation. In this work, we revisit the choice of energy-based models (EBM) as a policy class.

We show that the prevailing folklore---that energy models in high dimensional continuous spaces are impractical to train---is false. We develop a practical training objective and algorithm for energy models which combines several key ingredients: (i) ranking noise contrastive estimation (R-NCE), (ii) learnable negative samplers, and (iii) non-adversarial joint training. We prove that our proposed objective function is asymptotically consistent and quantify its limiting variance. On the other hand, we show that the Implicit Behavior Cloning (IBC) objective is actually biased even at the population level, providing a mathematical explanation for the poor performance of IBC trained energy policies in several independent follow-up works. We further extend our algorithm to learn a continuous stochastic process that bridges noise and data, modeling this process with a family of EBMs indexed by scale variable. In doing so, we demonstrate that the core idea behind recent progress in generative modeling is actually compatible with EBMs. Altogether, our proposed training algorithms enable us to train energy-based models as policies which compete with---and even outperform---diffusion models and other state-of-the-art approaches in several challenging multi-modal benchmarks: obstacle avoidance path planning and contact-rich block pushing.

URL: https://openreview.net/forum?id=JmKAYb7I00

---

Title: Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown Variances

Abstract: We address the problem of best arm identification (BAI) with a fixed budget for two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the best arm, an arm with the highest expected reward, through an adaptive experiment. \citet{Kaufman2016complexity} develops a lower bound for the probability of misidentifying the best arm. They also propose a strategy, assuming that the variances of rewards are known, and show that it is asymptotically optimal in the sense that its probability of misidentification matches the lower bound as the budget approaches infinity. However, an asymptotically optimal strategy is unknown when the variances are unknown. For this open issue, we propose a strategy that estimates variances during an adaptive experiment and draws arms with a ratio of the estimated standard deviations. We refer to this strategy as the \emph{Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)} strategy. We then demonstrate that this strategy is asymptotically optimal by showing that its probability of misidentification matches the lower bound when the budget approaches infinity, and the gap between the expected rewards of two arms approaches zero (\emph{small-gap regime}).
Our results suggest that under the worst-case scenario characterized by the small-gap regime, our strategy, which employs estimated variance, is asymptotically optimal even when the variances are unknown.

URL: https://openreview.net/forum?id=2Mdi4AAE1E

---

Title: ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

Abstract: Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. A grand challenge in I2V generation is to maintain visual consistency throughout the video: existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame, as well as ensure a fluid and logical progression within the video narrative. To mitigate these issues, we propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation. Specifically, we introduce (1) spatiotemporal attention over the first frame to maintain spatial and motion consistency, (2) noise initialization from the low-frequency band of the first frame to enhance layout consistency. These two approaches enable ConsistI2V to generate highly consistent videos. We also extend the proposed approaches to show their potential to improve consistency in auto-regressive long video generation and camera motion control. To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Our automatic and human evaluation results demonstrate the superiority of ConsistI2V over existing methods.

URL: https://openreview.net/forum?id=vqniLmUDvj

---

Title: Variance Reduced Smoothed Functional REINFORCE Policy Gradient Algorithms

Abstract: We revisit the REINFORCE policy gradient algorithm from the literature. This algorithm typically works with reward (or cost) returns obtained over episodes or trajectories. We pro- pose a major enhancement to the basic algorithm where we estimate the policy gradient us- ing a smoothed functional (random perturbation) gradient estimator requiring one function measurement over a perturbed parameter. Subsequently, we also propose a two-simulation counterpart of the algorithm that has lower estimator bias. Like REINFORCE, our algorithms are trajectory-based Monte-Carlo schemes and usually suffer from high variance. To handle this issue, we propose two independent enhancements to the basic scheme: (i) use the sign of the increment instead of the original (full) increment that results in smoother albeit possibly slower convergence and (ii) use clipped costs or rewards as proposed in the Proximal Policy Optimization (PPO) based scheme. We analyze the asymptotic convergence of the algorithm in the one-simulation case as well as the case where signed updates are used and briefly discuss the changes in analysis when two-simulation estimators are used. Finally, we show the results of several experiments on various Grid-World settings wherein we compare the performance of the various algorithms with REINFORCE as well as PPO and observe that both our one and two simulation SF algorithms show better performance than these algorithms. Further, the versions of these algorithms with clipped gradients and signed updates show good performance with lower variance.

URL: https://openreview.net/forum?id=loaWwnhYaS

---

Title: Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement

Abstract: Current methods for end-to-end constructive neural combinatorial optimization usually train a policy using behavior cloning from expert solutions or policy gradient methods from reinforcement learning. While behavior cloning is straightforward, it requires expensive expert solutions, and policy gradient methods are often computationally demanding and complex to fine-tune. In this work, we bridge the two and simplify the training process by sampling multiple solutions for random instances using the current model in each epoch and then selecting the best solution as an expert trajectory for supervised imitation learning. To achieve progressively improving solutions with minimal sampling, we introduce a method that combines round-wise Stochastic Beam Search with an update strategy derived from a provable policy improvement. This strategy refines the policy between rounds by utilizing the advantage of the sampled sequences with almost no computational overhead. We evaluate our approach on the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The models trained with our method achieve comparable performance and generalization to those trained with expert data. Additionally, we apply our method to the Job Shop Scheduling Problem using a transformer-based architecture and outperform existing state-of-the-art methods by a wide margin.

URL: https://openreview.net/forum?id=agT8ojoH0X

---

Title: Semi-Supervised Semantic Segmentation via Marginal Contextual Information

Abstract: We present a novel confidence refinement scheme that enhances pseudo-labels in semi-supervised semantic segmentation. Unlike existing methods, which filter pixels with low-confidence predictions in isolation, our approach leverages the spatial correlation of labels in segmentation maps by grouping neighboring pixels and considering their pseudo-labels collectively. With this contextual information, our method, named S4MC, increases the amount of unlabeled data used during training while maintaining the quality of the pseudo-labels, all with negligible computational overhead. Through extensive experiments on standard benchmarks, we demonstrate that S4MC outperforms existing state-of-the-art semi-supervised learning approaches, offering a promising solution for reducing the cost of acquiring dense annotations. For example, S4MC achieves a 1.39 mIoU improvement over the prior art on PASCAL VOC 12 with 366 annotated images. The code to reproduce our experiments is available at https://s4mcontext.github.io/

URL: https://openreview.net/forum?id=i5yKW1pmjW

---

Title: Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics

Abstract: Explainable AI (XAI) is a rapidly growing domain with a myriad of proposed methods as well as metrics aiming to evaluate their efficacy. However, current literature is often of limited scope, examining only a handful of XAI methods and ignoring underlying design parameters for performance, such as the model architecture or the nature of input data. Moreover, they often rely on one or a few metrics, neglecting thorough validation and increasing the risk of selection bias. These shortcomings leave practitioners confused about which method to choose for their problem. In response, we introduce LATEC, a large-scale benchmark that critically evaluates 17 prominent XAI methods using 20 distinct metrics. We systematically incorporate vital design parameters like varied architectures and diverse input modalities, resulting in 7,560 examined combinations. Through LATEC, we first showcase the high risk of conflicting metrics leading to unreliable rankings, and propose a robust evaluation scheme. Further, we comprehensively evaluate various XAI methods to assist practitioners in selecting appropriate methods aligning with their needs. Curiously, the emerging top-performing method, Expected Gradients, has not been examined in relevant related studies before. LATEC reinforces its role in future XAI research by publicly releasing all auxiliary data, including model weights, over 326k saliency maps, and 378k metric scores as a dataset. The benchmark is hosted at: https://github.com/kjdhfg/LATEC.

URL: https://openreview.net/forum?id=MQ64tVAcl6

---

Title: GPEN: Global Positional Encoding Network for Graphs

Abstract: Non-grid-structured data, e.g., citation networks, social networks, and web page networks, is often represented as graphs. However, such data cannot fit into Convolutional Neural Networks (CNNs) like images because of the variable size of unordered nodes and the uncertain number of neighbours for each node. Thus, Graph Neural Networks (GNNs) have been designed. They use a message-passing scheme to aggregate each node's and its neighbours' feature representations, regardless of the number of nodes and their order. Introducing feature-independent encoding methods to GNNs is crucial to preserving graphs' structural information and making node representations more discriminative. However, local-distance-aware methods, \eg, DE-GNN, only contain the information within subgraphs, resulting in ambiguity when comparing two subgraphs with the same structure. In this paper, our Global Positional Encoding Network (GPEN) is proposed to embed each node's global positional information by calculating their distances to a set of randomly sampled referential nodes. We employ contrastive loss on pairwise distances of different nodes to make positional representations more discriminative while retaining the relative interactions between nodes. We evaluate our GPEN on node classification datasets by inserting the encoding scheme into a backbone GNN and demonstrate that it outperforms state-of-the-art encoding methods on homophilic graph grains by up to 33.12% in accuracy.

URL: https://openreview.net/forum?id=XSfU9bmZnN

---

Title: C3DM: Constrained-Context Conditional Diffusion Models for Imitation Learning

Abstract: Behavior Cloning (BC) methods are effective at learning complex manipulation tasks. However, they are prone to spurious correlation - expressive models may focus on distractors that are irrelevant to action prediction - and are thus fragile in real-world deployment. Prior methods have addressed this challenge by exploring different model architectures and action representations. However, none were able to balance between sample efficiency and robustness against distractors for solving manipulation tasks with a complex action space. We present Constrained-Context Conditional Diffusion Model (C3DM), a diffusion model policy for solving 6-DoF robotic manipulation tasks with robustness to distractions that can learn deployable robot policies from as little as five demonstrations. A key component of C3DM is a \emph{fixation} step that helps the action denoiser to focus on task-relevant regions around a predicted \emph{fixation point} while ignoring distractors in the context. We empirically show that C3DM is robust to out-of-distribution distractors, and consistently achieves high success rates on a wide array of tasks, ranging from table-top manipulation to industrial kitting that require varying levels of precision and robustness to distractors.

URL: https://openreview.net/forum?id=jcleXdnRA1

---

Title: KNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Abstract: Rapid advancements in continual segmentation have to yet bridge the gap of scaling to large continually expanding vocabularies under compute-constrained scenarios. We discover that traditional continual training leads to catastrophic forgetting under compute constraints, unable to outperform zero-shot segmentation methods. We introduce a novel strategy for semantic and panoptic segmentation with zero forgetting, capable of adapting to continually growing vocabularies without the need for retraining or large memory costs. Our training-free approach, KNN-CLIP, leverages a database of instance embeddings to enable open-vocabulary segmentation approaches to continually expand their vocabulary on any given domain with a single-pass through data, while only storing embeddings minimizing both compute and memory costs. This method achieves state-of-the-art mIoU performance across large-vocabulary semantic and panoptic segmentation datasets. We hope KNN-CLIP represents a step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.

URL: https://openreview.net/forum?id=ZSqP1RT8jC

---

Title: In-context Learning with Retrieved Demonstrations for Language Models: A Survey

Abstract: Language models, especially pre-trained large language models, have showcased remarkable abilities as few-shot in-context learners (ICL), adept at adapting to new tasks with just a few demonstrations in the input context.
However, the model's ability to perform ICL is sensitive to the choice of the few-shot demonstrations.
Instead of using a fixed set of demonstrations, one recent development is to \emph{retrieve} demonstrations tailored to each input query.
The implementation of demonstration retrieval is relatively straightforward, leveraging existing databases and retrieval systems. This not only improves the efficiency and scalability of the learning process but also has been shown to reduce biases inherent in manual example selection. In light of the encouraging results and growing research in ICL with retrieved demonstrations, we conduct an extensive review of studies in this area. In this survey, we discuss and compare different design choices for retrieval models, retrieval training procedures, and inference algorithms.

URL: https://openreview.net/forum?id=NQPo8ZhQPa

---

Title: What Images are More Memorable to Machines?

Abstract: We study the problem of measuring and predicting how memorable an image is to pattern recognition machines, as a path to explore machine intelligence. Firstly, we propose a self-supervised machine memory quantification pipeline, dubbed `MachineMem measurer', to collect machine memorability scores of images. Similar to humans, machines also tend to memorize certain kinds of images, whereas the types of images that machines and humans memorize are different. Through in-depth analysis and comprehensive visualizations, we gradually unveil that `complex' images are usually more memorable to machines. We further conduct extensive experiments across 11 different machines and 9 pre-training methods to analyze and understand machine memory. This work proposes the concept of machine memorability and opens a new research direction at the interface between machine memory and visual data.

URL: https://openreview.net/forum?id=e8zI9o7M9G

---

Title: Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods

Abstract: We study stochastic Cubic Newton methods for solving general, possibly non-convex minimization problems. We propose a new framework, the helper framework, that provides a unified view of the stochastic and variance-reduced second-order algorithms equipped with global complexity guarantees; it can also be applied to learning with auxiliary information. Our helper framework offers the algorithm designer high flexibility for constructing and analyzing stochastic Cubic Newton methods, allowing arbitrary size batches and using noisy and possibly biased estimates of the gradients and Hessians, incorporating both the variance reduction and the lazy Hessian updates. We recover the best-known complexities for the stochastic and variance-reduced Cubic Newton under weak assumptions on the noise. A direct consequence of our theory is the new lazy stochastic second-order method, which significantly improves the arithmetic complexity for large dimension problems. We also establish complexity bounds for the classes of gradient-dominated objectives that include convex and strongly convex problems. For Auxiliary Learning, we show that using a helper (auxiliary function) can outperform training alone if a given similarity measure is small.

URL: https://openreview.net/forum?id=FCs5czlDTr

---

Title: PAITS: Pretraining and Augmentation for Irregularly-Sampled Time Series

Abstract: Real-world time series data that commonly reflect sequential human behavior are often uniquely irregularly sampled and sparse, with highly nonuniform sampling over time and entities. Yet, commonly-used pretraining and augmentation methods for time series are not specifically designed for such scenarios. In this paper, we present PAITS (Pretraining and Augmentation for Irregularly-sampled Time Series), a framework for identifying suitable pretraining strategies for sparse and irregularly sampled time series datasets. PAITS leverages a novel combination of NLP-inspired pretraining tasks and augmentations, and a random search to identify an effective strategy for a given dataset. We demonstrate that different datasets benefit from different pretraining choices. Compared with prior methods, our approach is better able to consistently improve pretraining across multiple datasets and domains. Our code is attached and will be publicly available.

URL: https://openreview.net/forum?id=snjM9YZXSR

---

Title: Weighted Risk Invariance: Domain Generalization under Invariant Feature Shift

Abstract: Learning models whose predictions are invariant under multiple environments is a promising approach for out-of-distribution generalization. Such models are trained to extract features $X_{\text{inv}}$ where the conditional distribution $Y \mid X_{\text{inv}}$ of the label given the extracted features does not change across environments. Invariant models are also supposed to generalize to shifts in the marginal distribution $p(X_{\text{inv}})$ of the extracted features $X_{\text{inv}}$, a type of shift we call an invariant covariate shift. However, we show that proposed methods for learning invariant models underperform under invariant covariate shift, either failing to learn invariant models---even for data generated from simple and well-studied linear-Gaussian models---or having poor finite-sample performance. To alleviate these problems, we propose weighted risk invariance (WRI). Our framework is based on imposing invariance of the loss across environments subject to appropriate reweightings of the training examples. We show that WRI provably learns invariant models, i.e. discards spurious correlations, in linear-Gaussian settings. We propose a practical algorithm to implement WRI by learning the density $p(X_{\text{inv}})$ and the model parameters simultaneously, and we demonstrate empirically that WRI outperforms previous invariant learning methods under invariant covariate shift.

URL: https://openreview.net/forum?id=WyPKLWPYsr

---

Title: SeqLink: A Robust Neural-ODE Architecture for Modelling Partially Observed Time Series

Abstract: Ordinary Differential Equations (ODE) based models have become popular as foundation models for solving many time series problems. Combining neural ODEs with traditional RNN models has provided the best representation for irregular time series. However, ODE-based models typically require the trajectory of hidden states to be defined based on either the initial observed value or the most recent observation, raising questions about their effectiveness when dealing with longer sequences and extended time intervals. In this article, we explore the behaviour of the ODE models in the context of time series data with varying degrees of sparsity. We introduce SeqLink, an innovative neural architecture designed to enhance the robustness of sequence representation. Unlike traditional approaches that solely rely on the hidden state generated from the last observed value, SeqLink leverages ODE latent representations derived from multiple data samples, enabling it to generate robust data representations regardless of sequence length or data sparsity level. The core concept behind our model is the definition of hidden states for the unobserved values based on the relationships between samples (links between sequences). Through extensive experiments on partially observed synthetic and real-world datasets, we demonstrate that SeqLink improves the modelling of intermittent time series, consistently outperforming state-of-the-art approaches.

URL: https://openreview.net/forum?id=WCUT6leXKf

---

Title: Bayesian Quantification with Black-Box Estimators

Abstract: Understanding how different classes are distributed in an unlabeled data set is important for the calibration of probabilistic classifiers and uncertainty quantification. Methods like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary and potentially biased black-box classifier trained on a different data set to estimate the class distribution on the current data set and yield asymptotic guarantees under weak assumptions. We demonstrate that these algorithms are closely related to the inference in a particular probabilistic graphical model, approximating the assumed ground-truth generative process, and propose a Bayesian estimator. Then, we discuss an efficient Markov chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the non-Bayesian alternatives.

URL: https://openreview.net/forum?id=Ft4kHrOawZ

---

Title: Learning State Reachability as a Graph in Translation Invariant Goal-based Reinforcement Learning Tasks

Abstract: Deep Reinforcement Learning proved efficient at learning universal control policies when the goal state is close enough to the starting state, or when the value function features few discontinuities.
But reaching goals that require long action sequences in complex environments remains difficult.
Drawing inspiration from the cognitive process which reuses learned atomic skills in a global planning procedure, we propose an algorithm which encodes reachability between abstract goals as a graph, and produces plans in this goal space.
Transitions between goals rely on the exploitation of a learned policy which enjoys a property we call \emph{translation invariant local optimality}, which encodes the intuition that goal-reaching skills can be reused throughout the state space.
Overall, our contribution permits solving large and difficult navigation tasks, outperforming related methods from the literature.

URL: https://openreview.net/forum?id=PkHkPQMTxg

---

Title: Hierarchically branched diffusion models leverage dataset structure for class-conditional generation

Abstract: Diffusion models have attained state-of-the-art performance in generating realistic objects, including when conditioning generation on class labels. Current class-conditional diffusion models, however, implicitly model the diffusion process on all classes in a flat fashion, ignoring any known relationships between classes. Class-labeled datasets, including those common in scientific domains, are rife with internal structure. To take advantage of this structure, we propose hierarchically branched diffusion models as a novel framework for class-conditional generation. Branched diffusion models explicitly leverage the inherent relationships between distinct classes in the dataset to learn the underlying diffusion process in a hierarchical manner. We highlight several advantages of branched diffusion models over the current state-of-the-art methods for class-conditional diffusion. Firstly, they can be easily extended to novel classes in a continual-learning setting at scale. Secondly, they enable more sophisticated forms of conditional generation, such as analogy-based conditional generation (i.e. transmutation). Finally, they offer a novel interpretability into the class-conditional generation process. We extensively evaluate branched diffusion models on several benchmark and large real-world scientific datasets, spanning different data modalities (images, tabular data, and graphs). We particularly highlight the advantages of branched diffusion models on a single-cell RNA-seq dataset, where our branched model leverages the intrinsic hierarchical structure between human cell types.

URL: https://openreview.net/forum?id=sGTfxqRbei

---

Title: The Kernel Perspective on Dynamic Mode Decomposition

Abstract: The purpose of the new DMD algorithm developed in this paper is to show that DMD methods very similar to KDMD emerge naturally out of a finite rank representation of the Koopman operator. It should be noted that the developed algorithm, while derived in a different way than traditional KDMD, involves computations that are nearly identical to KDMD, and as such, is not expected to offer any performance benefits over KDMD. Moreover, the algorithmic development of the present method does not invoke feature space representations and infinite matrices as in Williams et al., rather this method uses directly the properties of Koopman (or composition) operators and kernel functions. By doing so, this makes the theoretical dependencies of kernel based DMD methods transparent as densely defined operators over infinite dimensional kernel spaces. In order to present this new kernel perspective of Koopman analysis, the manuscript first introduces reproducing kernel Hilbert spaces (RKHSs) and examines the properties of Koopman operators over said spaces. Additionally, the examination of these properties led to the proof that the Koopman operator over the Gaussian RBF's native space is only bounded when it corresponds to discrete dynamics that are affine.

URL: https://openreview.net/forum?id=sIR8xV7hGl

---

Title: Byzantine-Resilient Decentralized Multi-Armed Bandits

Abstract: In decentralized cooperative multi-armed bandits (MAB), each agent observes a distinct stream of rewards, and seeks to exchange information with others to select a sequence of arms so as to minimize its regret. Agents in the cooperative setting can outperform a single agent running a MAB method such as Upper-Confidence Bound (UCB) independently. In this work, we study how to recover such salient behavior when an unknown fraction of the agents can be \emph{Byzantine}, that is, communicate arbitrarily wrong information in the form of reward mean-estimates or confidence sets. This framework can be used to model attackers in computer networks, instigators of offensive content into recommender systems, or manipulators of financial markets. Our key contribution is the development of a fully decentralized resilient upper confidence bound (UCB) algorithm that fuses an information mixing step among agents with a truncation of inconsistent and extreme values. This truncation step enables us to establish that the performance of each normal agent is no worse than the classic single-agent UCB1 algorithm in terms of regret, and more importantly, the cumulative regret of all normal agents is strictly better than the non-cooperative case, provided that each agent has at least $3f+1$ neighbors where $f$ is the maximum possible Byzantine agents in each agent's neighborhood. Extensions to time-varying neighbor graphs, and minimax lower bounds are further established on the achievable regret. Experiments corroborate the merits of this framework in practice.

URL: https://openreview.net/forum?id=JoYMJJdvry

---

Title: Synthesizing Tabular Data with Latent Semantic Regularization

Abstract: Modern generative models have shown remarkable capabilities in synthesizing tabular data, yet they often fall short in preserving the semantic integrity of generated samples, which can be interpreted as a form of hallucination. To address this gap, we propose a novel framework that formulates this problem as a constrained optimization problem and provides a solution for unsupervised learning of the implicit semantic constraints of the data and subsequently encouraging the generative model to respect the learned semantic boundaries through regularization. Our framework includes a \textit{validator} component in form of a latent space model that is tasked with capturing the underlying semantic structures of the training data. This generic validator can be used to regularize the \textit{synthesizer} model and steer it towards improving the semantic integrity of the synthesized data. We showcase our framework with a VAE-based validator and GAN-based synthesizer. We propose metrics designed specifically to measure the semantic integrity of the synthesized data and demonstrate that our approach not only maintains general quality of the generated data but also ensures a higher adherence to complex, domain-specific semantic relationships within the generated datasets.

URL: https://openreview.net/forum?id=84tYPYkfOD

---

Title: Making Translators Privacy-aware on the User's Side

Abstract: We propose PRISM to enable users of machine translation systems to preserve the privacy of data on their own initiative. There is a growing demand to apply machine translation systems to data that require privacy protection. While several machine translation engines claim to prioritize privacy, the extent and specifics of such protection are largely ambiguous. First, there is often a lack of clarity on how and to what degree the data is protected. Even if service providers believe they have sufficient safeguards in place, sophisticated adversaries might still extract sensitive information. Second, vulnerabilities may exist outside of these protective measures, such as within communication channels, potentially leading to data leakage. As a result, users are hesitant to utilize machine translation engines for data demanding high levels of privacy protection, thereby missing out on their benefits. PRISM resolves this problem. Instead of relying on the translation service to keep data safe, PRISM provides the means to protect data on the user's side. This approach ensures that even machine translation engines with inadequate privacy measures can be used securely. For platforms already equipped with privacy safeguards, PRISM acts as an additional protection layer, reinforcing their security furthermore. PRISM adds these privacy features without significantly compromising translation accuracy. We prove that PRISM enjoys the theoretical guarantee of word-level differential privacy. Our experiments demonstrate the effectiveness of PRISM using real-world translators, T5 and ChatGPT (GPT-3.5-turbo), and the datasets with two languages. PRISM effectively balances privacy protection with translation accuracy over other user-side privacy protection protocols and helps users grasp the content written in a foreign language without leaking the original content.

URL: https://openreview.net/forum?id=A6eqDMttcs

---

Title: SQL-PaLM: Improved large language model adaptation for Text-to-SQL

Abstract: Text-to-SQL, the process of translating natural language into Structured Query Language (SQL), represents a transformative application of large language models (LLMs), potentially revolutionizing how humans interact with data. This paper introduces the SQL-PaLM framework, a comprehensive solution for understanding and enhancing Text-to-SQL using LLMs, using in the learning regimes of few-shot prompting and instruction fine-tuning. With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error filtering. With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs. In particular, we investigate how performance can be improved through expanded training data coverage and diversity, synthetic data augmentation, and integrating query-specific database content. We propose a test-time selection method to further refine accuracy by integrating SQL outputs from multiple paradigms with execution feedback as guidance. Additionally, we tackle the practical challenge of navigating intricate databases with a significant number of tables and columns, proposing efficient techniques for accurately selecting relevant database elements to enhance Text-to-SQL performance. Our holistic approach yields substantial advancements in Text-to-SQL, as demonstrated on two key public benchmarks, Spider and BIRD. Through comprehensive ablations and error analyses, we shed light on the strengths and weaknesses of our framework, offering valuable insights into Text-to-SQL's future work.

URL: https://openreview.net/forum?id=rlloVZoKrX

---

Title: Simple Imputation Rules for Prediction with Missing Data: Theoretical Guarantees vs. Empirical Performance

Abstract: Missing data is a common issue in real-world datasets. This paper studies the performance of impute-then-regress pipelines by contrasting theoretical and empirical evidence. We establish the asymptotic consistency of such pipelines for a broad family of imputation methods. While common sense suggests that a 'good' imputation method produces datasets that are plausible, we show, on the contrary, that, as far as prediction is concerned, crude can be good. Among others, we find that mode-impute is asymptotically sub-optimal, while mean-impute is asymptotically optimal. We then exhaustively assess the validity of these theoretical conclusions on a large corpus of synthetic, semi-real, and real datasets. While the empirical evidence we collect mostly supports our theoretical findings, it also highlights gaps between theory and practice and opportunities for future research, regarding the relevance of the MAR assumption, the complex interdependency between the imputation and regression tasks, and the need for realistic synthetic data generation models.

URL: https://openreview.net/forum?id=IKH5ziX9dk

---

Title: Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction

Abstract: Drug-side effect prediction has become an essential area of research in the field of pharmacology. As the use of medications continues to rise, so does the importance of understanding and mitigating the potential risks associated with them. At present, researchers have turned to data-driven methods to predict drug-side effects. Drug-side effect prediction is a link prediction problem, and the related data can be described from various perspectives. To process these kinds of data, a multi-view method, called Multiple Kronecker RLS fusion-based link propagation (MKronRLSF-LP), is proposed. MKronRLSF-LP extends the Kron-RLS by finding the consensus partitions and multiple graph Laplacian constraints in the multi-view setting. Both of these multi-view settings contribute to a higher quality result. Extensive experiments have been conducted on drug-side effect datasets, and our empirical results provide evidence that our approach is effective and robust.

URL: https://openreview.net/forum?id=LCPzaR9mML

---

Title: Meta Learning for Support Recovery of High-Dimensional Ising Models

Abstract: In this paper, we consider the meta learning problem for estimating the graphs associated with high-dimensional Ising models, using the method of $\ell_1$-regularized logistic regression for neighborhood selection of each node. Our goal is to use the information learned from the auxiliary tasks in the learning of the novel task to reduce its sufficient sample complexity. To this end, we propose a novel generative model as well as an improper estimation method. In our setting, all the tasks are similar in their random model parameters and supports. By pooling all the samples from the auxiliary tasks to improperly estimate a single parameter vector, we can recover the true support union, assumed small in size, with a high probability with a sufficient sample complexity of $n = O(d^3 \log p/K)$ per task, for $K$ tasks of Ising models with $p$ nodes and a maximum neighborhood size $d$. This is very relevant for meta learning where there are many tasks $K = O(d^3 \log p)$, each with very few samples, i.e., $n = O(1)$, in an scenario where multi-task learning fails. We prove a matching information-theoretic lower bound for the necessary number of samples per task, which is $n = \Omega(d^3 \log p/K)$, and thus, our algorithm is minimax optimal. Finally, with the support for the novel task restricted to the estimated support union, we prove that consistent neighborhood selection for the novel task can be obtained with a sufficient sample complexity of $O(d^3 \log d)$. This reduces the original sample complexity of $n = O(d^3 \log p)$ for learning a single task. We also prove a matching information-theoretic lower bound of $\Omega(d^3 \log d)$ for the necessary number of samples.

URL: https://openreview.net/forum?id=n40EWwis1j

---

Title: SEAL: Simultaneous Label Hierarchy Exploration And Learning

Abstract: Label hierarchy is an important source of external knowledge that can enhance classification performance. However, most existing methods rely on predefined label hierarchies that may not match the data distribution. To address this issue, we propose Simultaneous label hierarchy Exploration And Learning (SEAL), a new framework that explores the label hierarchy by augmenting the observed labels with latent labels that follow a prior hierarchical structure. Our approach uses a 1-Wasserstein metric over the tree metric space as an objective function, which enables us to simultaneously learn a data-driven label hierarchy and perform (semi-)supervised learning. We evaluate our method on several standard benchmarks and show that it achieves improved results in semi-supervised image classification scenarios.

URL: https://openreview.net/forum?id=JZVqDTNA59

---

Title: State-Separated SARSA: A Practical Sequential Decision-Making Algorithm with Recovering Rewards

Abstract: While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states. The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings.

URL: https://openreview.net/forum?id=zw7JGv25Hb

---

Title: Discriminative reconstruction via simultaneous dense and sparse coding

Abstract: Discriminative features extracted from the sparse coding model have been shown to perform well for classification. Recent deep learning architectures have further improved reconstruction in inverse problems by considering new dense priors learned from data. We propose a novel dense and sparse coding model that integrates both representation capability and discriminative features. The model studies the problem of recovering a dense vector x and a sparse vector u given measurements of the form y = Ax+Bu. Our first analysis proposes a geometric condition based on the minimal angle between spanning subspaces corresponding to the matrices A and B that guarantees unique solution to the model. The second analysis shows that, under mild assumptions, a convex program recovers the dense and sparse components. We validate the effectiveness of the model on simulated data and propose a dense and sparse autoencoder (DenSaE) tailored to learning the dictionaries from the dense and sparse model. We demonstrate that (i) DenSaE denoises natural images better than architectures derived from the sparse coding model (Bu), (ii) in the presence of noise, training the biases in the latter amounts to implicitly learning the Ax + Bu model, (iii) A and B capture low- and high-frequency contents, respectively, and (iv) compared to the sparse coding model, DenSaE offers a balance between discriminative power and representation.

URL: https://openreview.net/forum?id=FkgM06HEbk

---

Title: Jigsaw Game: Federated Clustering

Abstract: Federated learning has recently garnered significant attention, especially within the domain of supervised learning. However, despite the abundance of unlabeled data on end-users, unsupervised learning problems such as clustering in the federated setting remain underexplored. In this paper, we investigate the federated clustering problem, with a focus on federated k-means. We outline the challenge posed by its non-convex objective and data heterogeneity in the federated framework. To tackle these challenges, we adopt a new perspective by studying the structures of local solutions in k-means and propose a one-shot algorithm called FeCA (Federated Centroid Aggregation). FeCA adaptively refines local solutions on clients, then aggregates these refined client solutions to recover the global solution of the entire dataset in a single round. We empirically demonstrate the robustness of FeCA under various federated scenarios on both synthetic and real-world data. Additionally, we extend FeCA to representation learning and present DeepFeCA, which combines DeepCluster and FeCA for unsupervised feature learning in the federated setting.

URL: https://openreview.net/forum?id=8YcUJbxmmC

---

Title: Fast, accurate and lightweight sequential simulation-based inference using Gaussian locally linear mappings

Abstract: Bayesian inference for complex models with an intractable likelihood can be tackled using algorithms performing many calls to computer simulators. These approaches are collectively known as "simulation-based inference" (SBI). Recent SBI methods have made use of neural networks (NN) to provide approximate, yet expressive constructs for the unavailable likelihood function and the posterior distribution. However, they do not generally achieve an optimal trade-off between accuracy and computational demand.
In this work, we propose an alternative that provides both approximations to the likelihood and the posterior distribution, using structured mixtures of probability distributions. Our approach produces accurate posterior inference when compared to state-of-the-art NN-based SBI methods, while exhibiting a much smaller computational footprint. We illustrate our results on several benchmark models from the SBI literature.

URL: https://openreview.net/forum?id=Q0nzpRcwWn

---

Title: Convergences for Minimax Optimization Problems over Infinite-Dimensional Spaces Towards Stability in Adversarial Training

Abstract: Training neural networks that require adversarial optimization, such as generative adversarial networks (GANs) and unsupervised domain adaptations (UDAs), suffers from instability. This instability problem comes from the difficulty of the minimax optimization, and there have been various approaches in GANs and UDAs to overcome this problem. In this study, we tackle this problem theoretically through a functional analysis. Specifically, we show the convergence property of the minimax problem by the gradient descent over the infinite-dimensional spaces of continuous functions and probability measures under certain conditions.
Using this setting, we can discuss GANs and UDAs comprehensively, which have been studied independently.
In addition, we show that the conditions necessary for the convergence property are interpreted as stabilization techniques of adversarial training such as the spectral normalization and the gradient penalty.

URL: https://openreview.net/forum?id=6LePXHr2f3

---

Title: On Using Large Language Models to Generate Plans

Abstract: Automated planning is concerned with developing efficient algorithms to generate plans or sequences of actions to achieve a specific goal in a given environment. Emerging Large Language Models (LLMs) can answer questions, write high-quality programming code, and predict protein folding, showcasing their versatility in solving various tasks beyond language-based problems. This paper explores if and how LLMs can also be used for automated planning given the diverse ways LLMs are modeled and trained. To do so, we seek to answer four key questions. Firstly, we want to understand the effectiveness of different LLM architectures for plan generation. Secondly, we aim to identify which pre-training data (general purpose vs code specific) effectively facilitates plan generation. Thirdly, we investigate whether fine-tuning or prompting is a more effective approach for plan generation. Finally, we explore whether LLMs are capable of plan generalization. By answering these questions, the study seeks to shed light on the capabilities of LLMs in solving complex planning problems and provide insights into the most effective approaches for using LLMs in this context.

URL: https://openreview.net/forum?id=JRoG71PAB0

---

Title: Multi-Fidelity Active Learning with GFlowNets

Abstract: In the last decades, the capacity to generate large amounts of data in science and engineering applications has been growing steadily. Meanwhile, the progress in machine learning has turned it into a suitable tool to process and utilise the available data. Nonetheless, many relevant scientific and engineering problems present challenges where current machine learning methods cannot yet efficiently leverage the available data and resources. For example, in scientific discovery, we are often faced with the problem of exploring very large, structured and high-dimensional spaces, and where querying a high fidelity, black-box objective function is very expensive. Progress in machine learning methods that can efficiently tackle such problems would help accelerate currently crucial areas such as drug and materials discovery. In this paper, we propose a multi-fidelity active learning algorithm with GFlowNets as a sampler, to efficiently discover diverse, high-scoring candidates where multiple approximations of the black-box function are available at lower fidelity and cost. Our evaluation on molecular discovery tasks show that multi-fidelity active learning with GFlowNets can discover high-scoring candidates at a fraction of the budget of its single-fidelity counterpart while maintaining diversity, unlike RL-based alternatives. These results open new avenues for multi-fidelity active learning to accelerate scientific discovery and engineering design.

URL: https://openreview.net/forum?id=dLaazW9zuF

---

Title: Closing the gap between SVRG and TD-SVRG with Gradient Splitting

Abstract: Temporal difference (TD) learning is a policy evaluation in reinforcement learning whose performance can be enhanced by variance reduction methods.
Recently, multiple works have sought to fuse TD learning with Stochastic Variance Reduced Gradient (SVRG) method to achieve a geometric rate of convergence.
However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization.
In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. Our main result is a geometric convergence bound with predetermined learning rate of $1/8$, which is identical to the convergence bound available for SVRG in the convex setting. Our theoretical findings are supported by a set of experiments.

URL: https://openreview.net/forum?id=dixU4fozPQ

---

Title: TransformMix: Learning Transformation and Mixing Strategies from Data

Abstract: Data augmentation improves the generalization power of deep learning models by synthesizing more training samples. Sample-mixing is a popular data augmentation approach that creates additional data by combining existing samples. Recent sample-mixing methods, like Mixup and Cutmix, adopt simple mixing operations to blend multiple inputs. Although such a heuristic approach shows certain performance gains in some computer vision tasks, it mixes the images blindly and does not adapt to different datasets automatically. A mixing strategy that is effective for a particular dataset does not often generalize well to other datasets. If not properly configured, the methods may create misleading mixed images, which jeopardize the effectiveness of sample-mixing augmentations. In this work, we propose an automated approach, TransformMix, to learn better transformation and mixing augmentation strategies from data. In particular, TransformMix applies learned transformations and mixing masks to create compelling mixed images that contain correct and important information for the target tasks. We demonstrate the effectiveness of TransformMix on multiple datasets in transfer learning, classification, object detection, and knowledge distillation settings. Experimental results show that our method achieves better performance as well as efficiency when compared with strong sample-mixing baselines.

URL: https://openreview.net/forum?id=64GhojjTiZ

---

Title: Learning Tree-Structured Composition of Data Augmentation

Abstract: Data augmentation is widely used in settings where one needs to learn a neural network given very little labeled data. Typically, a composition of several transformations is applied sequentially to transform a given sample. Existing approaches for finding a composition either rely on domain expertise, or involve solving a complex optimization problem. The key challenge is that for finding a composition of length $d$, the search space is $k^d$, given a list of $k$ transformation functions.

In this paper, we focus on designing efficient algorithms whose running time complexity is much faster. We propose a top-down recursive algorithm to search inside the space of tree-structured composition (of the $k$ transformations), where each tree node corresponds to one transformation. The tree structure can be viewed as a generalization of existing augmentation methods, such as the one constructed by SimCLR (Chen et al., 2020). Our algorithm runs in time $O(2^d k)$, which is much faster than the worst-case complexity of $O(k^d)$ (as soon as $k$ grows away from 2). We extend the algorithm to tackle data distributions with heterogeneous subpopulations by finding one tree for each subpopulation and then learning a weighted combination of the trees.

We validate the proposed algorithms on several graph and image data sets, including a multi-label graph classification data set we collected. The dataset exhibits significant variations in the sizes of graphs and their average degrees, making it ideal for studying data augmentation. On the graph classification data set, our proposed algorithms can reduce computation by 43% over several recent augmentation search methods while improving performance by 4.3%. Besides, extensive experiments in contrastive learning also validate the benefit of our algorithm. The tree structures allow one to interpret the relative role of each augmentation, for example, identifying the important transformations on small vs. large graphs.

URL: https://openreview.net/forum?id=lmgf03HeqV

---

Title: Predicting the Encoding Error of Implicit Neural Representations

Abstract: Implicit Neural Representations (INRs), which encode signals such as images, videos, and 3D shapes in the weights of neural networks, are becoming increasingly popular. Among their many applications is signal compression, for which there is great interest in achieving the highest possible fidelity to the original signal subject to constraints such as neural network size, training (encoding) and inference (decoding) time. But training INRs can be a computationally expensive process, making it challenging to determine the best possible tradeoff under such constraints. Towards this goal, we propose a novel problem: predicting the encoding error (i.e. training loss) that an INR will reach on a given training signal. We present a method which predicts the encoding error that a popular INR network (SIREN) will reach, given its network hyperparameters and the signal to encode. This method is trained on a unique dataset of 300,000 SIRENs, trained across a variety of images and hyperparameters. Our predictive method demonstrates the feasibility of this regression problem, and allows users to anticipate the encoding error that a SIREN network will reach in milliseconds instead of minutes or longer. We also provide insights into the behavior of SIREN networks, such as why narrow SIRENs can have very high random variation in encoding error, and how the performance of SIRENs relates to JPEG compression.

URL: https://openreview.net/forum?id=iKPC7N85Pf

---

Title: Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Abstract: Adaptive gradient-based optimizers, notably Adam, have left their mark in training large-scale deep learning models, offering fast convergence and robustness to hyperparameter settings. However, they often struggle with generalization, attributed to their tendency to converge to sharp minima in the loss landscape. To address this, we propose a new memory-augmented version of Adam that encourages exploration towards flatter minima by incorporating a buffer of critical momentum terms during training. This buffer prompts the optimizer to overshoot beyond narrow minima, promoting exploration. Through comprehensive analysis in simple settings, we illustrate the efficacy of our approach in increasing exploration and bias towards flatter minima. We empirically demonstrate that it can improve model performance for image classification on ImageNet and CIFAR10/100, language modelling on Penn Treebank,
and online learning tasks on TinyImageNet and 5-dataset.

URL: https://openreview.net/forum?id=sHSkJqyQgW

---

Title: IRWE: Inductive Random Walk for Joint Inference of Identity and Position Network Embedding

Abstract: Network embedding, which maps graphs to distributed representations, is a unified framework for various graph inference tasks. According to the topology properties (e.g., structural roles and community memberships of nodes) to be preserved, it can be categorized into the identity and position embedding. However, existing methods can only capture one type of property. Some approaches can support the inductive inference that generalizes the embedding model to new nodes or graphs but relies on the availability of attributes. Due to the complicated correlations between topology and attributes, it is unclear for some inductive methods which type of property they can capture. In this study, we explore a unified framework for the joint inductive inference of identity and position embeddings without attributes. An inductive random walk embedding (IRWE) method is proposed, which combines multiple attention units to handle the random walk on graph topology and simultaneously derives identity and position embeddings that are jointly optimized. In particular, we demonstrate that some random walk statistics can be informative features to characterize node identities and positions while supporting the inductive embedding inference. Experiments validate the superior performance of IRWE beyond various baselines for the transductive and inductive inference of identity and position embeddings.

URL: https://openreview.net/forum?id=bDse8Z2gff

---

Title: A Practical Guide to Statistical Distances for Evaluating Generative Models in Science

Abstract: Generative models are invaluable in many fields of science because of their ability to capture high-dimensional and complicated distributions, such as photo-realistic images, protein structures, and connectomes. How do we evaluate the samples these models generate? This work aims to provide an accessible entry point to understanding popular notions of statistical distances, requiring only foundational knowledge in mathematics and statistics. We focus on four commonly used notions of statistical distances representing different methodologies: Using low-dimensional projections (Sliced-Wasserstein; SW), obtaining a distance using classifiers (Classifier Two-Sample Tests; C2ST), using embeddings through kernels (Maximum Mean Discrepancy; MMD), or neural networks (Fréchet Inception Distance; FID). We highlight the intuition behind each distance and explain their merits, scalability, complexity, and pitfalls. To demonstrate how these distances are used in practice, we evaluate generative models from different scientific domains, namely a model of decision making and a model generating medical images. We showcase that distinct distances can give different results on similar data. Through this guide, we aim to help researchers to use, interpret, and evaluate statistical distances for generative models in science.

URL: https://openreview.net/forum?id=isEFziui9p

---

Title: Understanding the Role of Invariance in Transfer Learning

Abstract: Transfer learning is a powerful technique for knowledge-sharing between different tasks. Recent work has found that the representations of models with certain invariances, such as to adversarial input perturbations, achieve higher performance on downstream tasks. These findings suggest that invariance may be an important property in the context of transfer learning. However, the relationship of invariance with transfer performance is not fully understood yet and a number of questions remain. For instance, how important is invariance compared to other factors of the pretraining task? How transferable is learned invariance? In this work, we systematically investigate the importance of representational invariance for transfer learning, as well as how it interacts with other parameters during pretraining. To do so, we introduce a family of synthetic datasets that allow us to precisely control factors of variation both in training and test data. Using these datasets, we a) show that for learning representations with high transfer performance, invariance to the right transformations is as, or often more, important than most other factors such as the number of training samples, the model architecture and the identity of the pretraining classes, b) show conditions under which invariance can harm the ability to transfer representations and c) explore how transferable invariance is between tasks.

URL: https://openreview.net/forum?id=spJI4LSPIU

---

Title: Differentially Private Iterative Screening Rules for Linear Regression

Abstract: Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data science. Over the past decade, screening rules have risen in popularity as a way to reduce the runtime for producing the sparse regression weights of $L_1$ models. However, despite the increasing need of privacy-preserving models for data analysis, to the best of our knowledge, no differentially private screening rule exists. In this paper, we develop the first private screening rule for linear regression. We initially find that this screening rule is too strong: it screens too many coefficients as a result of the private screening step. However, a weakened implementation of private screening reduces overscreening and improves performance.

URL: https://openreview.net/forum?id=yY3viAfOPk

---

Title: Inference- and Optimization-based Approximated Solver for Dynamic Job-shop Scheduling Problem

Abstract: The Job-shop Scheduling Problem (JSP) is a well-known combinatorial optimization problem that arranges tasks for efficient processing. It is used in a broad range of industrial applications, such as smart manufacturing and transportation.
We focus on updating a schedule in a situation where the set of jobs varies, and we propose an inference-based model called JSPformer within data-driven scheme.
JSPformer permits a solution inference with a variable set of jobs by encoding input data into a set of job-wise feature vectors and by using a neural network for set-structured data.
Furthermore, for cases where a few minutes of computation is possible, we propose JSPformer+Opt, a hybrid model of JSPformer and a local optimization.
The local optimization is intended to make a more efficient schedule quickly from an inference solution. It uses part of the inference and optimizes the rest to improve the solution quality while reducing the problem size for fast computation.
In numerical experiments, JSPformer+Opt produced better or more competitive solutions for dynamic JSP instances within a minute compared to optimized solutions using an exact solver for over 30 minutes.

URL: https://openreview.net/forum?id=7ktz8aPULO

---

Title: How to choose the right transfer learning protocol? A qualitative analysis in a controlled set-up

Abstract: Transfer learning is a powerful technique that enables model training with limited amounts of data, making it crucial in many data-scarce real-world applications. Typically, transfer learning protocols require first to transfer all the feature-extractor layers of a network pre-trained on a data-rich source task, and then to adapt only the task-specific readout layers to a data-poor target task. This workflow is based on two main assumptions: first, the feature maps of the pre-trained model are qualitatively similar to the ones that would have been learned with enough data on the target task; second, the source representations of the last hidden layers are always the most expressive. In this work, we demonstrate that this is not always the case and that the largest performance gain may be achieved when smaller portions of the pre-trained network are transferred. In particular, we perform a set of numerical experiments in a controlled setting, showing how the optimal transfer depth depends non-trivially on the amount of available training data and on the degree of source-target task similarity, and it is often convenient to transfer only the first layers. We then propose a strategy to detect the most promising source task among the available candidates. This approach compares the internal representations of a network trained entirely from scratch on the target task with those of the networks pre-trained on the potential source tasks.

URL: https://openreview.net/forum?id=XWQgXLYwv2

---

Title: Graph Neural Modeling of Network Flows

Abstract: Network flow problems, which involve distributing traffic such that the underlying infrastructure is used effectively, are ubiquitous in transportation and logistics. Among them, the general Multi-Commodity Network Flow (MCNF) problem concerns the distribution of multiple flows of different sizes between several sources and sinks, while achieving effective utilization of the links. Due to the appeal of data-driven optimization, these problems have increasingly been approached using graph learning methods. In this paper, we propose a novel graph learning architecture for network flow problems called Per-Edge Weights (PEW). This method builds on a Graph Attention Network and uses distinctly parametrized message functions along each link. We extensively evaluate the proposed solution through an Internet flow routing case study using $21$ Service Provider topologies and $2$ routing schemes. We show that, with both synthetic and real-world traffic, PEW yields substantial gains over architectures whose global message function constrains the routing unnecessarily. We also find that an MLP is competitive with other standard architectures. Furthermore, we analyze the relationship between graph structure and predictive performance for data-driven routing of flows, an aspect that has not been considered by existing work in the area.

URL: https://openreview.net/forum?id=Ig7vEwO5G2

---

Title: A General-Purpose Multi-Modal OOD Detection Framework

Abstract: Out-of-distribution (OOD) detection seeks to identify test samples that deviate from the training data, which is critical to ensuring the safety and reliability of machine learning (ML) systems. While a plethora of methods have been developed to detect uni-modal OOD samples, only a few have focused on multi-modal OOD detection. Current contrastive learning-based methods primarily address multi-modal OOD detection in a scenario where an image is not related to the class labels in training data. However, ML systems in the real-world applications may encounter a broader spectrum of anomalies caused by different factors like systematic errors in labeling, environmental changes, and sensor malfunctions. Hence, we propose a new method to be able to simultaneously detect anomalies from multiple different OOD scenarios, arising from fine-grained image features and textual descriptions, instead of large categorical information. To achieve this goal, we propose a general-purpose weakly-supervised OOD detection framework, called WOOD, that combines a binary classifier and a contrastive learning module to reap the benefits of both. In order to better distinguish in-distribution (ID) samples from OOD ones, we employ the Hinge loss to constrain the similarity of their latent representations. Moreover, we devise a new scoring metric that fuses predictions from both the binary classifier and contrastive learning to enhance OOD detection. Extensive experimental results on multiple benchmarks demonstrate that the proposed WOOD significantly outperforms the state-of-the-art methods for multi-modal OOD detection. Importantly, our approach can achieve superior detection performance in a variety of OOD scenarios. The source code will be made publicly available upon publication.

URL: https://openreview.net/forum?id=nYzws7sSzo

---

Title: Reward Poisoning on Federated Reinforcement Learning

Abstract: Federated learning (FL) has become a popular tool for solving traditional Reinforcement Learning (RL) tasks. The multi-agent structure addresses the major concern of data-hungry in traditional RL, while the federated mechanism protects the data privacy of individual agents. Despite the advantage FL brings to RL, Federated Reinforcement Learning (FRL) is inherently susceptible to poisoning, as both FL and RL are vulnerable to such training-time attacks; however, the vulnerability of FRL has not been well-studied before. In this work, we propose a general framework to characterize FRL poisoning as an optimization problem and design a poisoning protocol that can be applied to policy-based FRL. Our framework is versatile, catering to FRL scenarios employing both policy-gradient local RL and actor-critic local RL. In the context of actor-critic configurations, we conduct training for a pair of critics, one private and one public, aimed at maximizing the potency of poisoning. We provably show that our method can strictly hurt the global objective. We verify the effectiveness of our poisoning approach through comprehensive experiments, supported by mainstream RL algorithms, across various RL OpenAI Gym environments covering a wide range of difficulty levels. Within these experiments, we assess our proposed attack by comparing it to various baselines, including standard, poisoned, and robust FRL methods. The results demonstrate the power of the proposed protocol in effectively poisoning FRL systems – It consistently diminishes performance across diverse environments, proving to be more effective than baseline methods. Our work provides new insights into the training-time vulnerability of FL in RL and poses new challenges for designing secure FRL algorithms.

URL: https://openreview.net/forum?id=h2jpFufyG4

---

Title: A Short Survey on Importance Weighting for Machine Learning

Abstract: Importance weighting is a fundamental procedure in statistics and machine learning that weights the objective function or probability distribution based on the importance of the instance in some sense. The simplicity and usefulness of the idea has led to many applications of importance weighting. For example, it is known that supervised learning under an assumption about the difference between the training and test distributions, called distribution shift, can guarantee statistically desirable properties through importance weighting by their density ratio. This survey summarizes the broad applications of importance weighting in machine learning and related research.

URL: https://openreview.net/forum?id=IhXM3g2gxg

---

Title: The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation

Abstract: Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i)~parallel word frequency distributions, (ii)~partially shared vocabulary, and (iii)~similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation. We conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation. Overall, the success of unsupervised machine translation was far from being analytically guaranteed. Instead, it is another proof that languages of the world share deep similarities, and we hope to show how to identify which of these similarities can serve the development of unsupervised, cross-linguistic tools.

URL: https://openreview.net/forum?id=6DflIABPQP

---

Title: Upper Bound of Bayesian Generalization Error in Partial Concept Bottleneck Model

Abstract: Concept Bottleneck Model (CBM) is a method for explaining neural networks.
In CBM, concepts which correspond to reasons of outputs are inserted in the last intermediate layer as observed values.
It is expected that we can interpret the relationship between the output and concept similar to linear regression.
However, this interpretation requires observing all concepts and increases the generalization error of neural networks.
Partial CBM (PCBM), which uses partially observed concepts, has been devised to resolve these difficulties.
Although some numerical experiments suggest that the generalization error of PCBMs is almost as low as that of the original neural networks,
the theoretical behavior of its generalization error has not been yet clarified because PCBM is singular statistical model.
In this paper, we reveal the Bayesian generalization error in PCBM with a three-layered and linear architecture.
The result indicates that the structure of partially observed concepts decreases the Bayesian generalization error compared with that of CBM (full-observed concepts).

URL: https://openreview.net/forum?id=HBY2iewjVr

---

Title: DIG-MILP: a Deep Instance Generator for Mixed-Integer Linear Programming with Feasibility Guarantee

Abstract: Mixed-integer linear programming (MILP) stands as a notable NP-hard problem pivotal to numerous crucial industrial applications. The development of effective algorithms, the tuning of solvers, and the training of machine learning models for MILP resolution all hinge on access to extensive, diverse, and representative data. Yet compared to the abundant naturally occurring data in image and text realms, MILP is markedly data deficient, underscoring the vital role of synthetic MILP generation. We present DIG-MILP, a deep generative framework adept at extracting deep-level structural features from highly limited MILP data and producing instances that closely mirror the target data. Notably, by leveraging the MILP duality, DIG-MILP guarantees a correct and complete generation space as well as ensures the boundedness and feasibility of the generated instances. Our empirical study highlights the novelty and quality of the instances generated by DIG-MILP through two distinct downstream tasks: (S1) Data sharing, where solver solution times correlate highly positive between original and DIG-MILP-generated instances, allowing data sharing for solver tuning without publishing the original data; (S2) Data Augmentation, wherein the DIG-MILP-generated instances bolster the generalization performance of machine learning models tasked with resolving MILP problems.

URL: https://openreview.net/forum?id=MywlrEaFqR

---

Title: Supervised Domain Adaptation Based on Marginal and Conditional Distributions Alignment

Abstract: Supervised domain adaptation (SDA) is an area of machine learning, where the goal is to achieve good generalization performance on data from a target domain, given a small corpus of labeled training data from the target domain and a large corpus of labeled data from a related source domain.
In this work, based on a generalization of a well-known theoretical result of \citet{ben2010theory}, we propose an SDA approach, in which the adaptation is performed by aligning the marginal and conditional components of the input-label joint distributions.
In addition to being theoretically grounded, we demonstrate that the proposed approach has two advantages over existing SDA approaches. First, it applies to a broad collection of learning tasks, such as regression, classification, multi-label classification, and few-shot learning. Second, it takes into account the geometric structure of the input and label spaces. Experimentally, despite its generality, our approach demonstrates on-par or superior results compared with recent state-of-the-art task-specific methods.

URL: https://openreview.net/forum?id=ffBj12yh58

---

Title: On the Importance of Uncertainty in Decision-Making with Large Language Models

Abstract: We investigate the role of uncertainty in decision-making problems with natural language as input. For such tasks, using Large Language Models as agents has become the norm. However, none of the recent approaches employ any additional phase for estimating the uncertainty the agent has about the world during the decision-making task. We focus on a fundamental decision-making framework with natural language as input, which is the one of contextual bandits, where the context information consists of text. As a representative of the approaches with no uncertainty estimation, we consider an LLM bandit with a greedy policy, which picks the action corresponding to the largest predicted reward. We compare this baseline to LLM bandits that make active use of uncertainty estimation by integrating the uncertainty in a Thompson Sampling policy. We employ different techniques for uncertainty estimation, such as Laplace Approximation, Dropout, and Epinets. We empirically show on real-world data that the greedy policy performs worse than the Thompson Sampling policies. These findings suggest that, while overlooked in the LLM literature, uncertainty plays a fundamental role in bandit tasks with LLMs.

URL: https://openreview.net/forum?id=YfPzUX6DdO

---

Title: A Dual-Perspective Approach to Evaluating Feature Attribution Methods

Abstract: Feature attribution methods attempt to explain neural network predictions by identifying relevant features. However, establishing a cohesive framework for assessing feature attribution remains a challenge. There are several views through which we can evaluate attributions. One principal lens is to observe the effect of perturbing attributed features on the model’s behavior (i.e., faithfulness).
While providing useful insights, existing faithfulness evaluations suffer from shortcomings that we reveal in this paper. To address the limitations of previous evaluations, in this work, we propose two new perspectives within the faithfulness paradigm that reveal intuitive properties: soundness and completeness. Soundness assesses the degree to which attributed features are truly predictive features, while completeness examines how well the resulting attribution reveals all the predictive features. The two perspectives are based on a firm mathematical foundation and provide quantitative metrics that are computable through efficient algorithms. We apply these metrics to mainstream attribution methods, offering a novel lens through which to analyze and compare feature attribution methods.

URL: https://openreview.net/forum?id=znlTP5RLur

---

Title: Decomposition of Equivariant Maps via Invariant Maps: Application to Universal Approximation under Symmetry.

Abstract: In this paper, we develop a theory about the relationship between invariant and equivariant maps with regard to a group $G$. We then leverage this theory in the context of deep neural networks with group symmetries in order to obtain novel insight into their mechanisms. More precisely, we establish a one-to-one relationship between equivariant maps and certain invariant maps. This allows us to reduce arguments for equivariant maps to those for invariant maps and vice versa. As an application, we propose a construction of universal equivariant architectures built from universal invariant networks. We, in turn, explain how the universal architectures arising from our construction differ from standard equivariant architectures known to be universal. Furthermore, we explore the complexity, in terms of the number of free parameters, of our models, and discuss the relation between invariant and equivariant networks' complexity. Finally, we also give an approximation rate for $G$-equivariant deep neural networks with ReLU activation functions for finite group $G$.

URL: https://openreview.net/forum?id=ycOLyHh1Ue

---

Title: Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints

Abstract: Learning to defer (L2D) aims to improve human-AI collaboration systems by learning how to defer decisions to humans when they are more likely to be correct than an ML classifier. Existing research in L2D overlooks key aspects of real-world systems that impede its practical adoption, namely: i) neglecting cost-sensitive scenarios, where type 1 and type 2 errors have different costs; ii) requiring concurrent human predictions for every instance of the training dataset and iii) not dealing with human work capacity constraints. To address these issues, we propose the deferral under cost and capacity constraints framework (DeCCaF). DeCCaF is a novel L2D approach, employing supervised learning to model the probability of human error under less restrictive data requirements (only one expert prediction per instance) and using constraint programming to globally minimize the error cost subject to workload limitations. We test DeCCaF in a series of cost-sensitive fraud detection scenarios with different teams of 9 synthetic fraud analysts, with individual work capacity constraints. The results demonstrate that our approach performs significantly better than the baselines in a wide array of scenarios, achieving an average $8.4\%$ reduction in the misclassification cost. The code used for the experiments is available at https://anonymous.4open.science/r/deccaf-1245/

URL: https://openreview.net/forum?id=TAvGZm2Rqb

---

Title: Reproducibility and Geometric Intrinsic Dimensionality: An Investigation on Graph Neural Network Research.

Abstract: Difficulties in replication and reproducibility of empirical evidences in machine learning
research have become a prominent topic in recent years. Ensuring that machine learning
research results are sound and reliable requires reproducibility, which verifies the reliability of
research findings using the same code and data. This promotes open and accessible research,
robust experimental workflows, and the rapid integration of new findings. Evaluating the
degree to which research publications support these different aspects of reproducibility is
one goal of the present work. For this we introduce an ontology of reproducibility in machine
learning and apply it to methods for graph neural networks.
Building on these efforts we turn towards another critical challenge in machine learning,
namely the curse of dimensionality, which poses challenges in data collection, representation,
and analysis, making it harder to find representative data and impeding the training and
inference processes. Using the closely linked concept of geometric intrinsic dimension we
investigate to which extend the used machine learning models are influenced by the intrinsic
dimension of the data sets they are trained on.

URL: https://openreview.net/forum?id=CtEGxIqtud

---

Title: LEA: Learning Latent Embedding Alignment Model for fMRI Decoding and Encoding

Abstract: The connection between brain activity and visual stimuli is crucial to understand the human brain. While deep generative models have exhibited advances in recovering brain recordings by generating images conditioned on fMRI signals, it is still challenge to generate consistent semantics. Moreover, the prediction of fMRI signal from visual stimuli remains a hard problem. In this paper, we introduce a unified framework that addresses both fMRI decoding and encoding. With training two latent spaces to represent and reconstruct fMRI signals and visual images, respectively, we align the fMRI signals and visual images within the latent spaces, allowing us to transform between the two seamlessly. Our model, called Latent Embedding Alignment (LEA), concurrently recovers visual stimuli from fMRI signals and predicts brain activity from images. LEA outperforms existing methods on multiple benchmark fMRI decoding and encoding datasets. LEA offers a comprehensive solution for modeling the relationship between fMRI signal and visual stimuli.

URL: https://openreview.net/forum?id=89QT2DsKyj

---

Title: Solving Robust MDPs through No-Regret Dynamics

Abstract: Reinforcement learning is a powerful framework for training agents to navigate different situations, but it is susceptible to changes in environmental dynamics. Generating an algorithm that can find environmentally robust policies efficiently and handle different model parameterizations without imposing stringent assumptions on the uncertainty set of transitions is difficult due to the intricate interactions between policy and environment. In this paper, we address both of these issues with a No-Regret Dynamics framework that utilizes policy gradient methods and iteratively approximates the worst case environment during training, avoiding assumptions on the uncertainty set. Alongside a toolbox of nonconvex online learning algorithms, we demonstrate that our framework can achieve fast convergence rates for many different problem settings and relax assumptions on the uncertainty set of transitions.

URL: https://openreview.net/forum?id=SdCuffxg5A

---

Title: What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

Abstract: This work aims to develop a measure that can accurately rank the performance of various classifiers when they are tested on unlabeled data from out-of-distribution (OOD) distributions. We commence by demonstrating that conventional uncertainty metrics, notably the maximum Softmax prediction probability, possess inherent utility in forecasting model generalization across certain OOD contexts. Building on this insight, we introduce a new measure called Softmax Correlation (SoftmaxCorr). It calculates the cosine similarity between a class-class correlation matrix, constructed from Softmax output vectors across an unlabeled test dataset, and a predefined reference matrix that embodies ideal class correlations. A high resemblance of predictions to the reference matrix signals that the model delivers confident and uniform predictions across all categories, reflecting minimal uncertainty and confusion. Through rigorous evaluation across a suite of datasets, including ImageNet, CIFAR-10, and WILDS, we affirm the predictive validity of SoftmaxCorr in accurately forecasting model performance within both in-distribution (ID) and OOD settings. Furthermore, we discuss the limitations of our proposed measure and suggest avenues for future research.

URL: https://openreview.net/forum?id=vtiDUgGjyx

---

Title: A Greedy Hierarchical Approach to Whole-Network Filter- Pruning in CNNs

Abstract: Deep convolutional neural networks (CNNs) have achieved impressive performance in many computer vision tasks. However, their large model sizes require heavy computational resources, making pruning redundant filters from existing pre-trained CNNs an essential task in developing efficient models for resource-constrained devices. Whole-network filter pruning algorithms prune varying fractions of filters from each layer, hence providing greater flexibility. State-of-the-art whole-network pruning methods are either computationally expensive due to the need to calculate the loss for each pruned filter using a training dataset, or use various heuristic / learned criteria for determining the pruning fractions for each layer. Hence there is a need for a simple and efficient technique for whole network pruning. This paper proposes a two-level hierarchical approach for whole-network filter pruning which is efficient and uses the classification loss as the final criterion. The lower-level algorithm (called filter-pruning) uses a sparse-approximation formulation based on linear approximation of filter weights. We explore two algorithms: orthogonal matching pursuit-based greedy selection and a greedy backward pruning approach. The backward pruning algorithm uses a novel closed-form error criterion for efficiently selecting the optimal filter at each stage, thus making the whole algorithm much faster. The higher-level algorithm (called layer-selection) greedily selects the best-pruned layer (pruning using the filter-selection algorithm) using a global pruning criterion. We propose algorithms for two different global-pruning criteria: (1) layerwise- relative error (HBGS), and (2) final classification error (HBGTS). Our suite of algorithms outperforms state-of-the-art pruning methods on ResNet18, ResNet32, ResNet56, VGG16, and ResNext101. Our method reduces the RAM requirement for ResNext101 from 7.6 GB to 1.5 GB and achieves a 94% reduction in FLOPS without losing accuracy on CIFAR-10.

URL: https://openreview.net/forum?id=WzHuebRSgQ

---

Title: Training-free linear image inverses via flows

Abstract: Solving inverse problems without any training involves using a pretrained generative model and making appropriate modifications to the generation process to avoid finetuning of the generative model. While recent methods have explored the use of diffusion models, they still require the manual tuning of many hyperparameters for different inverse problems. In this work, we propose a training-free method for solving linear inverse problems by using pretrained flow models, leveraging the simplicity and efficiency of Flow Matching models, using theoretically-justified weighting schemes, and thereby significantly reducing the amount of manual tuning. In particular, we draw inspiration from two main sources: adopting prior gradient correction methods to the flow regime, and a solver scheme based on conditional Optimal Transport paths. As pretrained diffusion models are widely accessible, we also show how to practically adapt diffusion models for our method. Empirically, our approach requires no problem-specific tuning across an extensive suite of noisy linear inverse problems on high-dimensional datasets, ImageNet-64/128 and AFHQ-256, and we observe that our flow-based method for solving inverse problems improves upon closely-related diffusion-based methods in most settings.

URL: https://openreview.net/forum?id=PLIt3a4yTm

---

Title: Robust Semi-Supervised Metric Learning Meets with High Dimensionality

Abstract: Classical semi-supervised metric learning usually formulates the objectives via maximizing/minimizing the ratio formed with must-links and cannot-links. However, the presence of noise and adversarial attacks can result in incorrect pairings, which will diminish the reliability of learned projection directions. To develop a robust distance metric learning method, we propose a new objective for distance metric learning using the $\ell_{2,q}$-norm ($0<q<2$) distances which will alleviate the influence of outliers or adversarial attacks. We develop an algorithm that will decrease the objective monotonically with updates. Additionally, we address computational burdens (e.g., $\mathcal{O}(d^3)$ complexity, where $d$ is the size of features) by introducing a 2D metric learning algorithm and extending it to arbitrary dimensions with kernel methods, backed by theoretical guarantees. Extensive empirical evaluations consistently demonstrate the superiority of our methods across various experimental setups.

URL: https://openreview.net/forum?id=wB12L5h4Og

---

Title: Controlling the Fidelity and Diversity of Deep Generative Models via Pseudo Density

Abstract: We introduce an approach to bias deep generative models, such as GANs and diffusion models, towards generating data with either enhanced fidelity or increased diversity. Our approach involves manipulating the distribution of training and generated data through a novel metric for individual samples, named pseudo density, which is based on the nearest-neighbor information from real samples. Our approach offers three distinct techniques to adjust the fidelity and diversity of deep generative models: 1) Per-sample perturbation, enabling precise adjustments for individual samples towards either more common or more unique characteristics; 2) Importance sampling during model inference to enhance either fidelity or diversity in the generated data; 3) Fine-tuning with importance sampling, which guides the generative model to learn an adjusted distribution, thus controlling fidelity and diversity. Furthermore, our fine-tuning method demonstrates the ability to improve the Frechet Inception Distance (FID) for pre-trained generative models with minimal iterations.

URL: https://openreview.net/forum?id=8Vk1Bmg3sY

---

Title: Towards Unbiased Calibration using Meta-Regularization

Abstract: Model miscalibration has been frequently identified in modern deep neural networks. Recent work aims to improve model calibration directly through a differentiable calibration proxy. However, the calibration produced is often biased due to the binning mechanism. In this work, we propose to learn better-calibrated models via meta-regularization which has two components: (1) gamma network (gamma-net), a meta learner to output sample-wise gamma value (continuous variable) for focal loss for regularizing the backbone network; (2) smooth expected calibration error (SECE), a Gaussian-kernel-based unbiased and differentiable surrogate to SECE that enables the smooth optimization of gamma-net. We evaluate the effectiveness of the proposed approach in regularizing neural networks towards better and unbiased calibration on three computer vision datasets. We empirically demonstrate that: (a) learning sample-wise $\gamma$ as continuous variables can effectively improve calibration; (b) SECE smoothly optimizes gamma-net towards unbiasedness and robustness to binning schemes; and (c) the combination of gamma-net and SECE achieves the best calibration performance across various calibration metrics and retains very competitive predictive performance as compared to multiple recently proposed methods.

URL: https://openreview.net/forum?id=Yf8iHCfG4W

---

Title: Simple and Scalable Strategies to Continually Pre-train Large Language Models

Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models—saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that autoregressive transformer-based LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.

URL: https://openreview.net/forum?id=DimPeeCxKO

---

Title: MDGCL: Debiased Graph Contrastive Learning with Knowledge of Model Discrepancy

Abstract: Graph contrastive learning (GCL) have shown promising results for self-supervised representation learning on graph-structured data, benefiting various downstream tasks such as node classification and graph classification. Despite their outstanding performance, a prevalent issue in most existing GCL methods is the arbitrary selection of other data points as negative samples, even when they share the same ground truth label with the anchor. The inclusion of such false negative samples could degrade the performance of GCL. In this study, we present a dual-branch ensembling learning framework, which provides model discrepancy as a crucial indicator to more effectively differentiate false negatives from true negatives. Building on this, we develop a debiased contrastive learning objective. This objective focuses on pulling false negatives closer to the anchor in the embedding space, while simultaneously retaining the capacity to repel true negatives away from the anchor. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework.

URL: https://openreview.net/forum?id=V403aPMHM3

---

Title: Winning the Lottery Once and For All: Towards Pruning Neural Networks at Initialization

Abstract: Lottery Ticket Hypothesis posits the existence of winning tickets, i.e., sparse subnetworks within randomly initialized dense neural networks that are capable of achieving test accuracy comparable to the original, unpruned counterpart when trained from scratch, with an optimal learning rate and in a similar training budget. Despite this promising conjecture, recent studies have cast doubt on the feasibility of identifying such winning tickets at initialization, particularly in large-scale settings. They suggest that in such expansive environments, winning tickets exclusively and only emerge during the early phase of training. This observation, contradicts the core tenet of LTH as these winning tickets do not truly win the initialization lottery. In light of recent findings, we address a critical question: If winning tickets can only be obtained during early iterations, does the initial training phase of a neural network encode vital knowledge, which we refer to as lottery-ticket information, that can be utilized to generate winning tickets at initialization, especially in large-scale scenarios?

We affirmatively answer this question by introducing a novel premise, Knowledge Distillation-based Lottery Ticket Search. Our framework harnesses latent response, feature, and relation-based lottery-ticket information from an ensemble of teacher networks, employing a series of deterministic approximations to address an intractable Mixed Integer Optimization problem. This enables us to consistently win the initialization lottery in complex settings, identifying winning tickets right from the initialization point at sparsity levels - achieving as high as 95% for VGG-16 and 65\% for ResNet-20, and accomplishing this 19 times faster than Iterative Magnitude Pruning (IMP). Remarkably, without bells and whistles, even winning tickets identified early in the training process using our technique - consistently yield a performance gain of 2% for VGG-16 and 1.5% for ResNet-20 across various levels of sparsity, thereby surpassing existing methods.

URL: https://openreview.net/forum?id=YIOL0wdHsi

---

Title: CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

Abstract: We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. Recognizing the importance of accurate alignment between video and audio events in multi-modal generation tasks, we propose a joint contrastive training loss to enhance the synchronization between visual and auditory occurrences.
Our research methodology involves conducting comprehensive experiments on multiple datasets to thoroughly evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics.
Our findings demonstrate that the proposed model outperforms the baseline, substantiating its effectiveness and efficiency. Notably, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task. These results indicate the potential of our proposed model as a robust solution for improving the quality and alignment of multi-modal generation, thereby contributing to the advancement of video and audio conditional generation systems.

URL: https://openreview.net/forum?id=ZW6ZATI4iI

---

Title: Regret Bounds for Noise-Free Cascaded Kernelized Bandits

Abstract: We consider optimizing a function network in the noise-free grey-box setting with RKHS function classes, where the exact intermediate results are observable. We assume that the structure of the network is known (but not the underlying functions comprising it), and we study three types of structures: (1) chain: a cascade of scalar-valued functions, (2) multi-output chain: a cascade of vector-valued functions, and (3) feed-forward network: a fully connected feed-forward network of scalar-valued functions. We propose a sequential upper confidence bound based algorithm GPN-UCB along with a general theoretical upper bound on the cumulative regret. In addition, we propose a non-adaptive sampling based method along with its theoretical upper bound on the simple regret for the Mat\'ern kernel. We also provide algorithm-independent lower bounds on the simple regret and cumulative regret. Our regret bounds for GPN-UCB have the same dependence on the time horizon as the best known in the vanilla black-box setting, as well as near-optimal dependencies on other parameters (e.g., RKHS norm and network length).

URL: https://openreview.net/forum?id=oCfamUtecN

---

Title: A Survey on Data Selection for Language Models

Abstract: A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required.

Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies.

To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

URL: https://openreview.net/forum?id=XfHWcNTSHp

---

Title: Text Descriptions are Compressive and Invariant Representations for Visual Learning

Abstract: Modern image classification is based on directly predicting classes via large discriminative networks, which do not directly contain information about the intuitive visual features that may constitute a classification decision. Recently, work in vision-language models (VLM) such as CLIP has provided ways to specify natural language descriptions of image classes, but typically focuses on providing single descriptions for each class. In this work, we demonstrate that an alternative approach, in line with humans' understanding of multiple visual features per class, can also provide compelling performance in the robust few-shot learning setting. In particular, we introduce a novel method, \textit{SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors)}. This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify each image. Core to our approach is the fact that, information-theoretically, these descriptive features are more invariant to domain shift than traditional image embeddings, even though the VLM training process is not explicitly designed for invariant representation learning. These invariant descriptive features also compose a better input compression scheme. When combined with finetuning, we show that SLR-AVD is able to outperform existing state-of-the-art finetuning approaches in both in-distribution and out-of-distribution tasks.

URL: https://openreview.net/forum?id=spo705Fyv0

---

Title: Exact Fractional Inference via Re-Parametrization \& Interpolation between Tree-Re-Weighted- and Belief Propagation- Algorithms

Abstract: Inference efforts -- required to compute partition function, $Z$, of an Ising model over a graph of $N$ ``spins" -- are most likely exponential in $N$. Efficient variational methods, such as Belief Propagation (BP) and Tree Re-Weighted (TRW) algorithms, compute $Z$ approximately minimizing respective (BP- or TRW-) free energy. We generalize the variational scheme building a $\lambda$- fractional- homotopy, $Z^{(\lambda)}$, where $\lambda=0$ and $\lambda=1$ correspond to TRW- and BP-approximations, respectively, and $Z^{(\lambda)}$ decreases with $\lambda$ monotonically. Moreover, this fractional scheme guarantees that in the attractive (ferromagnetic) case $Z^{(TRW)}\geq Z^{(\lambda)}\geq Z^{(BP)}$, and there exists a unique (``exact") $\lambda_*$ such that, $Z=Z^{(\lambda_*)}$. Generalizing the re-parametrization and the loop series approach, we show how to express $Z$ as a product, $\forall \lambda:\ Z=Z^{(\lambda)}{\cal Z}^{(\lambda)}$, where the multiplicative correction, ${\cal Z}^{(\lambda)}$, is an expectation over a node-independent probability distribution built from node-wise fractional marginals. Our theoretical analysis is complemented by extensive experiments with models from Ising ensembles over planar and random graphs of medium- and large- sizes. The empirical study yields a number of interesting observations, such as (a) ability to estimate ${\cal Z}^{(\lambda)}$ with $O(N^4)$ fractional samples; (b) suppression of $\lambda_*$ fluctuations with increase in $N$ for instances from a particular random Ising ensemble.

URL: https://openreview.net/forum?id=AWRpSgaNfc

---

Title: Uncovering Sets of Maximum Dissimilarity on Random Process Data

Abstract: The comparison of local characteristics of two random processes can shed light on periods of time or space at which the processes differ the most. This paper proposes a method that learns about regions with a certain volume, where the marginal attributes of two processes are less similar. The proposed methods are devised in full generality for the setting where the data of interest are themselves stochastic processes, and thus the proposed method can be used for pointing out the regions of maximum dissimilarity with a certain volume, in the contexts of point processes, functional data, and time series. The parameter functions underlying both stochastic processes of interest are modeled via a basis representation, and Bayesian inference is conducted via an integrated nested Laplace approximation. The numerical studies validate the proposed methods, and we showcase their application with case studies on criminology, finance, and medicine.

URL: https://openreview.net/forum?id=ntWCJrlDD8

---

Title: Misspecification-robust Sequential Neural Likelihood for Simulation-based Inference

Abstract: Simulation-based inference techniques are indispensable for parameter estimation of mechanistic and simulable models with intractable likelihoods. While traditional statistical approaches like approximate Bayesian computation and Bayesian synthetic likelihood have been studied under well-specified and misspecified settings, they often suffer from inefficiencies due to wasted model simulations. Neural approaches, such as sequential neural likelihood (SNL) avoid this wastage by utilising all model simulations to train a neural surrogate for the likelihood function. However, the performance of SNL under model misspecification is unreliable and can result in overconfident posteriors centred around an inaccurate parameter estimate. In this paper, we propose a novel SNL method, which through the incorporation of additional adjustment parameters, is robust to model misspecification and capable of identifying features of the data that the model is not able to recover. We demonstrate the efficacy of our approach through several illustrative examples, where our method gives more accurate point estimates and uncertainty quantification than SNL.

URL: https://openreview.net/forum?id=tbOYJwXhcY

---

Title: Neural Graph Reasoning: A Survey on Complex Logical Query Answering

Abstract: Complex logical query answering (CLQA) is a recently emerged task of graph machine learning that goes beyond simple one-hop link prediction and solves the far more complex task of multi-hop logical reasoning over massive, potentially incomplete graphs.
The task received significant traction in the community; numerous works expanded the field along theoretical and practical axes to tackle different types of complex queries and graph modalities with efficient systems.
In this paper, we provide a holistic survey of CLQA with a detailed taxonomy studying the field from multiple angles, including graph types (modality, reasoning domain, background semantics), modeling aspects (encoder, processor, decoder), supported queries (operators, patterns, projected variables), datasets, evaluation metrics, and applications.
Finally, we point out promising directions, unsolved problems and applications of CLQA for future research.

URL: https://openreview.net/forum?id=xG8un9ZbqT

---

Title: Spike Accumulation Forwarding for Effective Training of Spiking Neural Networks

Abstract: In this article, we propose a new paradigm for training spiking neural networks (SNNs), spike accumulation forwarding (SAF). It is known that SNNs are energy-efficient but difficult to train. Consequently, many researchers have proposed various methods to solve this problem, among which online training through time (OTTT) is a method that allows inferring at each time step while suppressing the memory cost. However, to compute efficiently on GPUs, OTTT requires operations with spike trains and weighted summation of spike trains during forwarding. In addition, OTTT has shown a relationship with the Spike Representation, an alternative training method, though theoretical agreement with Spike Representation has not to be proven. Our proposed method can solve these problems; namely, SAF can halve the number of operations during the forward process, and it can be theoretically proven that SAF is consistent with the Spike Representation and OTTT, respectively. Furthermore, we confirmed the above contents through experiments and showed that it is possible to reduce memory and training time while maintaining accuracy.

URL: https://openreview.net/forum?id=RGQsUQDAd9

---

Title: Community Detection: Exact Inference with Latent Variables in an Arbitrary Domain

Abstract: We analyze the necessary and sufficient conditions for exact inference of a latent model \added{in the context of community detection}. In latent models, each entity is associated with a latent variable following some probability distribution. The challenging question we try to solve is: can we perform exact inference without observing the latent variables, even without knowing what the domain of the latent variables is? We show that exact inference can be achieved using a semidefinite programming (SDP) approach without knowing either the latent variables or their domain. Our analysis predicts the experimental correctness of SDP with high accuracy, showing the suitability of our focus on the Karush-Kuhn-Tucker conditions and the spectrum of a properly defined matrix. Running on a laptop equivalent, our method can achieve exact inference in models with over 10000 entities efficiently. As a byproduct of our analysis, we also provide concentration inequalities with dependence on latent variables, both for bounded moment generating functions as well as for the spectra of matrices. To the best of our knowledge, these results are novel and could be useful for many other problems.

URL: https://openreview.net/forum?id=8ZVW3gRooy

---

Title: On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data

Abstract: Existing theoretical results (such as (Woodworth et al., 2020a)) predict that the performance of federated averaging (FedAvg) is exacerbated by high data heterogeneity. However, in practice, FedAvg converges pretty well on several naturally heterogeneous datasets. In order to explain this seemingly unreasonable effectiveness of FedAvg that contradicts previous theoretical predictions, this paper introduces the client consensus hypothesis: on certain federated datasets, the average of local models updates on clients starting from the optimum is close to zero. We prove that under this hypothesis, data heterogeneity does not exacerbate the convergence of FedAvg. Moreover, we show that this hypothesis holds for a linear regression problem and some naturally heterogeneous datasets such as FEMNIST and StackOverflow. Therefore, we believe that this hypothesis can better explain the performance of FedAvg in practice.

URL: https://openreview.net/forum?id=zF76Ga4EPs

---

Title: XAI-Based Detection of Adversarial Attacks on Deepfake Detectors

Abstract: We introduce a novel methodology for identifying adversarial attacks on deepfake detectors using eXplainable Artificial Intelligence (XAI). In an era characterized by digital advancement, deepfakes have emerged as a potent tool, creating a demand for efficient detection systems. However, these systems are frequently targeted by adversarial attacks that inhibit their performance. We address this gap, developing a defensible deepfake detector by leveraging the power of XAI. The proposed methodology uses XAI to generate interpretability maps for a given method, providing explicit visualizations of decision-making factors within the AI models. We subsequently employ a pretrained feature extractor that processes both the input image and its corresponding XAI image. The feature embeddings extracted from this process are then used for training a simple yet effective classifier. Our approach contributes not only to the detection of deepfakes but also enhances the understanding of possible adversarial attacks, pinpointing potential vulnerabilities. Furthermore, this approach does not change the performance of the deepfake detector. The paper demonstrates promising results suggesting a potential pathway for future deepfake detection mechanisms. We believe this study will serve as a valuable contribution to the community, sparking much-needed discourse on safeguarding deepfake detectors.

URL: https://openreview.net/forum?id=7pBKrcn199

---

Title: The Garden of Forking paths: Observing Dynamic Parameters Distribution in Large Language Models

Abstract: A substantial gap persists in understanding the reasons behind the exceptional performance of the Transformer architecture in NLP.
A particularly unexplored area involves the mechanistic description of how the distribution of parameters evolves over time during training.
In this work we suggest that looking at the time evolution of the statistic distribution of model parameters, and specifically at bifurcation effects, can help understanding the model quality, potentially reducing training costs and evaluation efforts and empirically showing the reasons behind the effectiveness of weights sparsification.

URL: https://openreview.net/forum?id=0yfQkuvtH1

---

Title: Sparse Contextual CDF Regression

Abstract: Estimating cumulative distribution functions (CDFs) of context dependent random variables is a central statistical task underpinning numerous applications in machine learning and economics. In this work, we extend a recent line of theoretical inquiry into this domain by analyzing the problem of \emph{sparse contextual CDF regression}, wherein data points are sampled from a convex combination of $s$ context dependent CDFs chosen from a set of $d$ basis functions. We show that adaptations of several canonical regression methods serve as tractable estimators in this functional sparse regression setting under standard assumptions on the conditioning of the basis functions. In particular, given $n$ data samples, we prove estimation error upper bounds of $\tilde{O}(\sqrt{s/n})$ for functional versions of the lasso and Dantzig selector estimators, and $\tilde{O}(\sqrt{s}/\sqrt[4]{n})$ for a functional version of the elastic net estimator. Our results match the corresponding error bounds for finite dimensional regression and improve upon CDF ridge regression which has $\tilde{O}(\sqrt{d/n})$ sample complexity. Finally, we obtain a matching information-theoretic lower bound which establishes the minimax optimality of the lasso and Dantzig selector estimators up to logarithmic factors.

URL: https://openreview.net/forum?id=AIc48TjuSt

---

Title: TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

Abstract: We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks.
Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors.
To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that \metricname{} can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate.
Through these experimental results, we believe \metricname{} demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.

URL: https://openreview.net/forum?id=EE1CBKC0SZ

---

Title: SA-MLP: Distilling Graph Knowledge from GNNs into Structure-Aware MLP

Abstract: The recursive node fetching and aggregation in message-passing cause inference latency when deploying Graph Neural Networks (GNNs) to large-scale graphs.
One promising inference acceleration direction is to distill GNNs into message-passing-free student Multi-Layer Perceptrons (MLPs).
However, the MLP student without graph dependency cannot fully learn the structure knowledge from GNNs, which causes inferior performance in heterophilic and online scenarios.
To address this problem, we first design a simple yet effective Structure-Aware MLP (SA-MLP) as a student model. It utilizes linear layers as encoders and decoders to capture features and graph structures without message-passing among nodes.
Furthermore, we introduce a novel structure-mixing knowledge distillation technique. It generates virtual samples imbued with a hybrid of structure knowledge from teacher GNNs, thereby enhancing the learning ability of MLPs for structure information.
Extensive experiments on eight benchmark datasets under both transductive and online settings show that our SA-MLP can consistently achieve similar or even better results than teacher GNNs while maintaining as fast inference speed as MLPs.
Our findings reveal that SA-MLP efficiently assimilates graph knowledge through distillation from GNNs in an end-to-end manner, eliminating the need for complex model architectures and preprocessing of features/structures.

URL: https://openreview.net/forum?id=MZ2kKZc8m7

---

Title: Input Normalized Stochastic Gradient Descent Training for Deep Neural Networks

Abstract: In this paper, we propose a novel optimization algorithm for training machine learning models called Input Normalized Stochastic Gradient Descent (INSGD), inspired by the Normalized Least Mean Squares (NLMS) algorithm used in adaptive filtering. When train- ing complex models on large datasets, the choice of optimizer parameters, particularly the learning rate, is crucial to avoid divergence. Our algorithm updates the network weights using stochastic gradient descent with l1 and l2-based normalizations applied to the learn- ing rate, similar to NLMS. However, unlike existing normalization methods, we exclude the error term from the normalization process and instead normalize the update term us- ing the input vector to the neuron. Our experiments demonstrate that our optimization algorithm achieves higher accuracy levels compared to different initialization settings. We evaluate the efficiency of our training algorithm on benchmark datasets using ResNet-20, Vision Transformer, MobileNetV3, WResNet-18, ResNet-50, and a toy neural network. Our INSGD algorithm improves the mean accuracy of ResNet-20 on CIFAR-10 from 92.57% to 92.67%, the accuracy of MobileNetV3 on CIFAR-10 from 90.83% to 91.13%, WResNet-18 on CIFAR-100 from 78.24% to 78.47%, and ResNet-50 on ImageNet-1K from 75.60% to 75.92%.

URL: https://openreview.net/forum?id=5TaBxctwRZ

---

Title: iHyperTime: Interpretable Time Series Generation with Implicit Neural Representations

Abstract: Implicit neural representations (INRs) have emerged as a powerful tool that provides an accurate and resolution-independent encoding of data. Their robustness as general approximators has been shown across diverse data modalities, such as images, video, audio, and 3D scenes. However, little attention has been given to leveraging these architectures for time series data. Addressing this gap, we propose an approach for time series generation based on two novel architectures: TSNet, an INR network for interpretable trend-seasonality time series representation, and iHyperTime, a hypernetwork architecture that leverages TSNet for time series generalization and synthesis. Through evaluations of fidelity and usefulness metrics, we demonstrate that iHyperTime outperforms current state-of-the-art methods in challenging scenarios that involve long or irregularly sampled time series, while performing on par on regularly sampled data. Furthermore, we showcase iHyperTime fast training speed, comparable to the fastest existing methods for short sequences and significantly superior for longer ones. Finally, we empirically validate the quality of the model's unsupervised trend-seasonality decomposition by comparing against the well-established STL method.

URL: https://openreview.net/forum?id=GSnGPgeoS5

---

Title: Understanding Compositionality in Data Embeddings

Abstract: Embeddings are used in AI to represent symbolic structures such as knowledge graphs. However, the representations obtained cannot be directly interpreted by humans, and may further contain unintended information. We investigate how data embeddings might incorporate such information, despite that information not being used during the training process. We introduce two methods: (1) Correlation-based Compositionality Detection, which measures correlation between known attributes and embeddings, and (2) Additive Compositionality Detection, a process of decomposing embeddings into an additive composition of individual vectors representing attributes. We apply our methods across two domains: word or sentence embeddings and knowledge graph embeddings. We show that word embeddings can be interpreted as composed of semantic and morphological information, and that sentence embeddings can be interpreted as the sum of individual word embeddings. In the domain of knowledge graph embeddings, our methods show that attributes of graph nodes can be inferred, even when these attributes are not used in training the embeddings. Our methods are an improvement over previous approaches for decomposing embeddings in that our methods are 1) more general: they can be applied to multiple embedding types; 2) provide quantitative information about the decomposition; and 3) provide a statistically robust metric for determining the decomposition of an embedding.

URL: https://openreview.net/forum?id=WN9zvxi1Cr

---

Title: MaskMA: Towards Zero-Shot Multi-Agent Decision Making with Mask-Based Collaborative Learning

Abstract: Building a single generalist agent with strong zero-shot capability has recently sparked significant advancements. However, extending this capability to multi-agent decision making scenarios presents challenges. Most current works struggle with zero-shot transfer, due to two challenges particular to the multi-agent settings: (a) a mismatch between centralized training and decentralized execution; and (b) difficulties in creating generalizable representations across diverse tasks due to varying agent numbers and action spaces. To overcome these challenges, we propose a Mask-Based collaborative learning framework for Multi-Agent decision making (MaskMA). Firstly, we randomly mask part of the units and collaboratively learn the policies of unmasked units to handle the mismatch. In addition, MaskMA integrates a generalizable action representation by dividing the action space into intrinsic actions solely related to the unit itself and interactive actions involving interactions with other units. This flexibility allows MaskMA to tackle tasks with varying agent numbers and thus different action spaces. Extensive experiments in SMAC reveal MaskMA, with a single model trained on 11 training maps, can achieve an impressive 77.8% average zero-shot win rate on 60 unseen test maps by decentralized execution, while also performing effectively on other types of downstream tasks (e.g., varied policies collaboration, ally malfunction, and ad hoc team play).

URL: https://openreview.net/forum?id=Susy8EAff9

---

Title: Zero-Order One-Point Estimate with Distributed Stochastic Gradient-Tracking Technique

Abstract: In this work, we consider a distributed multi-agent stochastic optimization problem, where each agent holds a local objective function that is smooth and convex and that is subject to a stochastic process. The goal is for all agents to collaborate to find a common solution that optimizes the sum of these local functions. With the practical assumption that agents can only obtain noisy numerical function queries at precisely one point at a time, we extend the distributed stochastic gradient-tracking method to the bandit setting where we do not have an estimate of the gradient, and we introduce a zero-order (ZO) one-point estimate (1P-DSGT). We analyze the convergence of this novel technique for smooth and convex objectives using stochastic approximation tools, and we prove that it \textit{converges almost surely to the optimum} despite the biasedness of our gradient estimate. We then study the convergence rate for when the objectives are additionally strongly convex. With constant step sizes, our method competes with its first-order (FO) counterparts by achieving a linear rate $O(\varrho^k)$ as a function of number of iterations $k$. To the best of our knowledge, this is the first work that proves this rate in the noisy estimation setting or with one-point estimators. With vanishing step sizes, we establish a rate of $O(\frac{1}{\sqrt{k}})$ after a sufficient number of iterations $k > K_2$. This is the optimal rate proven in the literature for centralized techniques utilizing one-point estimators. We then provide a regret bound of $O(\sqrt{k})$ with vanishing step sizes. We further illustrate the usefulness of the proposed technique using numerical experiments.

URL: https://openreview.net/forum?id=yAoavrPtBq

---

Title: Finite-Time Analysis of Temporal Difference Learning with Experience Replay

Abstract: Temporal-difference (TD) learning is widely regarded as one of the most popular algorithms in reinforcement learning (RL). Despite its widespread use, it has only been recently that researchers have begun to actively study its finite time behavior, including the finite time bound on mean squared error and sample complexity. On the empirical side, experience replay has been a key ingredient in the success of deep RL algorithms, but its theoretical effects on RL have yet to be fully understood. In this paper, we present a simple decomposition of the Markovian noise terms and provide finite-time error bounds for tabular on-policy TD-learning with experience replay. Specifically, under the Markovian observation model, we demonstrate that for both the averaged iterate and final iterate cases, the error term induced by a constant step-size can be effectively controlled by the size of the replay buffer and the mini-batch sampled from the experience replay buffer.

URL: https://openreview.net/forum?id=A5ulGfDBON

---

Title: Are Large Language Models Really Robust to Word-Level Perturbations?

Abstract: The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, potentially ignoring the superior generation capabilities of contemporary LLMs. To investigate the robustness of LLMs while using their generation ability, we propose a novel rational evaluation pipeline that leverages reward models as diagnostic tools to evaluate the long conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters.Our extensive empirical experiments demonstrate that TREvaL provides an identification for the lack of robustness of nowadays LLMs.Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted, calling for more attention on the robustness during alignment process.

URL: https://openreview.net/forum?id=BMKJEGNMcZ

---

Title: Stochastic Fractional Gradient Descent with Caputo $L_1$ Scheme for Deep Neural Networks

Abstract: Stochastic gradient descent (SGD) has been used as a standard method to optimize deep neural networks (DNNs), where it essentially deals with first-order derivatives. Incorporating fractional derivatives into learning algorithms is expected to improve model performance, especially when the corresponding optimization problems involve objective functions with memory effects or long-range dependencies. The Caputo derivative is a fractional derivative that maintains consistency with integer-order calculus and produces more reliable solutions than other fractional derivatives, especially for differential equations. In this paper, we propose a novel Caputo-based SGD algorithm tailored for training DNNs. Our method exploits the Caputo $L_1$ scheme to achieve highly effective training and accurate prediction for large data by using gradient information from its past history to guide parameter updates in a more informed direction. This allows it to avoid local minima and saddle points, resulting in faster convergence to the target value. We conducted experiments on several benchmark datasets to evaluate our method. The results show that our method can improve the empirical performance over some traditional optimization methods in both accuracy and convergence.

URL: https://openreview.net/forum?id=hCGaySEW9q

---

Title: Active Learning via Classifier Impact and Greedy Selection for Interactive Image Retrieval

Abstract: Active Learning (AL) is a user-interactive approach aimed at reducing annotation costs by selecting the most crucial examples to label. Although AL has been extensively studied for image classification tasks, the specific scenario of interactive image retrieval has received relatively little attention. This scenario presents unique characteristics, including an open-set and class-imbalanced binary classification, starting with very few labeled samples. We introduce a novel batch-mode Active Learning framework named GAL (Greedy Active Learning) that better copes with this application. It incorporates a new acquisition function for sample selection that measures the impact of each unlabeled sample on the classifier. We further embed this strategy in a greedy selection approach, better exploiting the samples within each batch. We evaluate our framework with both linear (SVM) and non-linear MLP/Gaussian Process classifiers. For the Gaussian Process case, we show a theoretical guarantee on the greedy approximation. Finally, we assess our performance for the interactive content-based image retrieval task on several benchmarks and demonstrate its superiority over existing approaches and common baselines.

URL: https://openreview.net/forum?id=b68QOenPWy

---

Title: Boosting Visual-Language Models by Exploiting Hard Pairs

Abstract: Contrastive Language-Image Pre-training (CLIP) has become the standard for learning cross-modal representations between images and text. Efforts to improve its capabilities typically demand the collection of additional data and retraining with new loss functions. While effective, the added requirements limit their practical use due to the increased resource and time investments needed. In this work, we present Helip, a cost-effective strategy tailored to enhance the performance of existing CLIP models without the need for training a model from scratch or collecting additional data. Our method allows for effortless integration with existing models’ training pipelines, providing an instant boost by training them with selected challenging text-image pairs from their original training datasets. Helip treats each text-image pair as a single point in the joint vision-language space, identifying those in close proximity as hard pairs. By incorporating the challenging data, pre-trained CLIP models are refined using both the traditional contrastive loss and the newly introduced hard negative margin loss, ensuring the challenging data is fully utilized. On comprehensive benchmarks, Helip consistently boosts existing models to achieve leading performance. In particular, it improves the zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M and YFCC15M datasets. The improvements are 3.05%, 4.47%, and 10.1% respectively, achieved within two epochs of training. In addition, across fine-grained classification datasets, Helip improves the zero-shot performance of pre-trained CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%. The code is publicly available at https://anonymous.4open.science/r/HELIP-7F8E/.

URL: https://openreview.net/forum?id=41ZFz7EZzX

---

Title: Federated Learning with Reduced Information Leakage and Computation

Abstract: Federated learning (FL) is a distributed learning paradigm that allows multiple decentralized clients to collaboratively learn a common model without sharing local data. Although local data is not exposed directly, privacy concerns nonetheless exist as clients' sensitive information can be inferred from intermediate computations. Moreover, such information leakage accumulates substantially over time as the same data is repeatedly used during the iterative learning process. As a result, it can be particularly difficult to balance the privacy-accuracy trade-off when designing privacy-preserving FL algorithms. This paper introduces Upcycled-FL, a simple yet effective strategy that applies first-order approximation at every even round of model update. Under this strategy, half of the FL updates incur no information leakage and require much less computational and transmission costs. We first conduct the theoretical analysis on the convergence (rate) of Upcycled-FL and then apply two perturbation mechanisms to preserve privacy.
Extensive experiments on both synthetic data and real-world data show that the Upcycled-FL strategy can be adapted to many existing FL frameworks and consistently improve the privacy-accuracy trade-off.

URL: https://openreview.net/forum?id=ZJ4A3xhADV

---

Title: Correcting Flaws in Common Disentanglement Metrics

Abstract: Disentangled representations are those in which distinct features, such as size or shape, are represented by distinct neurons. Quantifying the extent to which a given representation is disentangled is not straightforward; multiple metrics have been proposed. In this paper, we identify two failings of existing metrics, which mean they can assign a high score to a model which is still entangled, and we propose two new metrics, which redress these problems. First, we use hypothetical toy examples to demonstrate the failure modes we identify for existing metrics. Then, we show that similar situations occur in practice. Finally, we validate our metrics on the downstream task of compositional generalization. We measure the performance of six existing disentanglement models on this downstream compositional generalization task, and show that performance is (a) generally quite poor, (b) correlated, to varying degrees, with most disentanglement metrics, and (c) most strongly correlated with our newly proposed metrics. Anonymous code to reproduce our results is available at https://github.com/anon296/anon.

URL: https://openreview.net/forum?id=c8WJ4Vozb2

---

Title: Estimating class separability of text embeddings with persistent homology.

Abstract: This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class sep- arability. More specifically, we show how this technique can be applied to detect when the training process stops improving the separability of the embeddings. Our results, validated across binary and multi-class text classification tasks, show that the proposed method’s estimates of class separability align with those obtained from supervised methods. This approach offers a novel perspective on monitoring and improving the fine-tuning of sentence transformers for classification tasks, particularly in scenarios where labeled data is scarce. We also discuss how tracking these quantities can provide additional insights into the properties of the trained classifier.

URL: https://openreview.net/forum?id=8DWrIMuLya

---

Title: Learning Counterfactually Invariant Predictors

Abstract: Notions of counterfactual invariance (CI) have proven essential for predictors that are fair, robust, and generalizable in the real world. We propose graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of a conditional independence in the observational distribution. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactually Invariant Prediction (CIP), building on the Hilbert-Schmidt Conditional Independence Criterion (HSCIC), a kernel-based conditional dependence measure. Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various simulated and real-world datasets including scalar and multi-variate settings.

URL: https://openreview.net/forum?id=pRt1Vw1DPs

---

Title: Hierarchical Prototype-based Explanations

Abstract: To interpret deep neural networks, one main approach is to dissect the visual input and find the prototypical parts responsible for the classification. However, existing methods often ignore the hierarchical relationship between these prototypes, and thus can not explain semantic concepts at both higher level (e.g., water sports) and lower level (e.g., swimming). In this paper inspired by human cognition system, we leverage hierarchal information to deal with uncertainty: When we observe water and human activity, but no definitive action it can be recognized as the water sports parent class. Only after observing a person swimming can we definitively refine it to the swimming action. To this end, we propose HIerarchical Prototype Explainer (HIPE) to build hierarchical relations between prototypes and classes. HIPE enables a reasoning process by dissecting the input video frames on multiple levels of the class hierarchy. The faithfulness of our method is verified by reducing accuracy-explainability trade off on ActivityNet and UCF-101 while providing multi-level explanations.

URL: https://openreview.net/forum?id=sLb02HyABv

---

Title: Switching Latent Bandits

Abstract: We consider a Latent Bandit problem where the latent state keeps changing in time according to an underlying Markov chain, and every state is represented by a specific Bandit instance. At each step, the agent chooses an arm and observes a random reward but is unaware of which MAB he is currently pulling. As typical in Latent Bandits, we assume to know the reward distribution of the arms of all the Bandit instances. Within this setting, our goal is to learn the transition matrix determined by the Markov process.
We propose a technique to tackle this estimation problem that results in solving a least-square problem obtained by exploiting the knowledge of the reward distributions and the properties of Markov chains. We prove the consistency of the estimation procedure, and we make a theoretical comparison with standard Spectral Decomposition techniques. We then discuss the dependency of the problem on the number of arms and present an offline method that chooses the best subset of possible arms that can be used for the estimation of the transition model. We ultimately introduce the SL-EC algorithm based on an Explore then Commit strategy that uses the proposed approach to estimate the transition model during the exploration phase. This algorithm achieves a regret of the order $\widetilde{\mathcal{O}}(T^{2/3})$ when compared against an oracle that builds a belief representation of the current state using the knowledge of both the observation and transition model and optimizes the expected instantaneous reward at each step. Finally, we illustrate the effectiveness of the approach and compare it with state-of-the-art algorithms for non-stationary bandits and with a modified technique based on spectral decomposition.

URL: https://openreview.net/forum?id=4ZGqCXcUqR

---

Title: Uncertainty in Graph Neural Networks: A Survey

Abstract: Graph Neural Networks (GNNs) have been extensively used in various real-world applications. However, the predictive uncertainty of GNNs stemming from diverse sources such as inherent randomness in data and model training errors can lead to unstable and erroneous predictions. Therefore, identifying, quantifying, and utilizing uncertainty are essential to enhance the performance of the model for the downstream tasks as well as the reliability of the GNN predictions. This survey aims to provide a comprehensive overview of the GNNs from the perspective of uncertainty with an emphasis on its integration in graph learning. We compare and summarize existing graph uncertainty theory and methods, alongside the corresponding downstream tasks. Thereby, we bridge the gap between theory and practice, meanwhile connecting different GNN communities. Moreover, our work provides valuable insights into promising directions in this field.

URL: https://openreview.net/forum?id=0e1Kn76HM1

---

Title: Directed Graph Transformers

Abstract: In this paper, we address the problem of capturing graph directionality using transformers. Most existing graph transformers typically capture distances between graph nodes and do not take edge direction into account. This is a limiting assumption since many graph applications need to exploit sophisticated relationships in graph data, such as time, causality, or generic dependency constraints. We introduce a novel graph transformer architecture that explicitly takes into account the directionality between connected graph nodes. To achieve this, we make use of dual encodings to represent both potential roles, i.e., source or target, of each pair of vertices linked by a directed edge. These encodings are learned by leveraging the latent adjacency information extracted from a directional attention module, localized with $k$-hop neighborhood information. Extensive experiments on synthetic and real graph datasets show that our approach can have significant accuracy gains over previous graph transformer (GT) and graph neural network (GNN) approaches, providing state-of-the-art (SOTA) results on inherently directed graphs.

URL: https://openreview.net/forum?id=otTFPjziiK

---

Title: Bytes Are All You Need: Transformers Operating Directly On File Bytes

Abstract: Modern deep learning approaches usually utilize modality-specific processing. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network. Instead, we investigate \textit{modality-independent} representation learning by performing classification directly on file bytes, without the need for decoding files at inference time. This enables models to operate on various modalities without any hand-designed, modality-specific processing. Our model, \emph{ByteFormer}, improves ImageNet Top-1 classification accuracy by $5\%$ (from $72.2\%$ to $77.33\%$) relative to DeIT models of similar size. Compared to Perceiver IO, our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on ImageNet. We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing. We achieve $95.42\%$ classification accuracy on the Speech Commands V2 dataset (comparable to the state-of-the-art accuracy of $98.7\%$). Additionally, we demonstrate that ByteFormer can operate jointly on images and audio, handling joint classification without explicit knowledge of the input modality. We will open source our code.

URL: https://openreview.net/forum?id=RkaqxxAOfN

---

Title: SPriFed-OMP: A Differentially Private Federated Learning Algorithm for Sparse Basis Recovery

Abstract: Sparse basis recovery is a classical and important statistical learning problem when the number of model dimensions $p$ is much larger than the number of samples $n$. However, there has been little work that studies sparse basis recovery in the Federated Learning (FL) setting, where the client data's differential privacy (DP) must also be simultaneously protected. In particular, the performance guarantees of existing DP-FL algorithms (such as DP-SGD) will degrade significantly when $p \gg n$, and thus, they will fail to learn the true underlying sparse model accurately. In this work, we develop a new differentially private sparse basis recovery algorithm for the FL setting, called SPriFed-OMP. SPriFed-OMP converts OMP (Orthogonal Matching Pursuit) to the FL setting. Further, it combines SMPC (secure multi-party computation) and DP to ensure that only a small amount of noise needs to be added in order to achieve differential privacy. As a result, SPriFed-OMP can efficiently recover the true sparse basis for a linear model with only $n = \mathcal{O}(\sqrt{p})$ samples. We further present an enhanced version of our approach, SPriFed-OMP-GRAD, based on gradient privatization, that improves the performance of SPriFed-OMP. Our theoretical analysis and empirical results demonstrate that both SPriFed-OMP and SPriFed-OMP-GRAD terminate in a small number of steps, and they significantly outperform the previous state-of-the-art DP-FL solutions in terms of the accuracy-privacy trade-off.

URL: https://openreview.net/forum?id=Dsavre6gjN

---

Title: Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling

Abstract: The graduated optimization approach is a heuristic method for finding globally optimal solutions for nonconvex functions and has been theoretically analyzed in several studies. This paper defines a new family of nonconvex functions for graduated optimization, discusses their sufficient conditions, and provides a convergence analysis of the graduated optimization algorithm for them. It shows that stochastic gradient descent (SGD) with mini-batch stochastic gradients has the effect of smoothing the function, the degree of which is determined by the learning rate and batch size. This finding provides theoretical insights on why large batch sizes fall into sharp local minima, why decaying learning rates and increasing batch sizes are superior to fixed learning rates and batch sizes, and what the optimal learning rate scheduling is. To the best of our knowledge, this is the first paper to provide a theoretical explanation for these aspects. Moreover, a new graduated optimization framework that uses a decaying learning rate and increasing batch size is analyzed and experimental results of image classification that support our theoretical findings are reported.

URL: https://openreview.net/forum?id=puaPiO82yI

---

Title: Referential communication in heterogeneous communities of pre-trained visual deep networks

Abstract: As large pre-trained image-processing neural networks are being embedded in autonomous agents such as self-driving cars or robots, the question arises of how such systems can communicate with each other about the surrounding world, despite their different architectures and training regimes.
As a first step in this direction, we systematically explore the task of $\textit{referential communication}$ in a community of heterogeneous state-of-the-art pre-trained visual networks, showing that they can develop, in a self-supervised way, a shared protocol to refer to a target object among a set of candidates. This shared protocol can also be used, to some extent, to communicate about previously unseen object categories of different granularity. Moreover, a visual network that was not initially part of an existing community can learn the community's protocol with remarkable ease. Finally, we study, both qualitatively and quantitatively, the properties of the emergent protocol, providing some evidence that it is capturing high-level semantic features of objects.

URL: https://openreview.net/forum?id=NbbU8zr2v4

---

Title: Masked multi-prediction for multi-aspect anomaly detection

Abstract: In this paper, we address the anomaly detection problem in the context of heterogeneous normal observations and propose an approach that accounts for this heterogeneity. Although prediction-based methods are common to learn normality, the vast majority of previous work predicts a single outcome, which is generally not sufficient to account for the multiplicity of possible normal observations. To address this issue, we introduce a new masked multi-prediction (MMP) approach that produces multiple likely normal outcomes, and show both theoretically and experimentally that it improves normality learning and leads to a better anomaly detection performance. In addition, we observed that normality can be characterized from multiple aspects, depending on the types of anomalies we would like to detect. Therefore, we propose an adaptation (MMP-AMS) of our approach to cover multiple aspects of normality such as appearance, motion, semantics and location. Since we model each aspect separately, our approach has the advantage of being interpretable and modular, as we can select only a subset of normality aspects. The experiments conducted on several benchmarks show the effectiveness of the proposed approach.

URL: https://openreview.net/forum?id=7wybYcK1pw

---

Title: ADIR: Adaptive Diffusion for Image Reconstruction

Abstract: In recent years, denoising diffusion models have demonstrated outstanding image generation performance.
The information on natural images captured by these models is useful for many image reconstruction applications,
where the task is to restore a clean image from its degraded observation. In this work, we propose a conditional sampling scheme that exploits the prior learned by diffusion models while retaining agreement with the measurements. We then combine it with a novel approach for adapting pre-trained diffusion denoising networks to their input. We perform the adaptation using images that are ``nearest neighbours'' to the degraded image, retrieved from a diverse dataset using an off-the-shelf visual-language model. To evaluate our method, we test it on two state-of-the-art publicly available diffusion models, Stable Diffusion and Guided Diffusion. We show that our proposed \textbf{A}daptive \textbf{D}iffusion for \textbf{I}mage \textbf{R}econstruction (\textbf{ADIR}) approach achieves significant improvement in image reconstruction tasks. Our code will be available online upon publication.

URL: https://openreview.net/forum?id=0dc2TtHpZB

---

Title: How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

Abstract: Despite superior reasoning prowess demonstrated by Large Language Models (LLMs) with Chain-of-Thought (CoT) prompting, a lack of understanding prevails around the internal mechanisms of the models that facilitate CoT generation. This work investigates the neural sub-structures within LLMs that manifest CoT reasoning from a mechanistic point of view. From an analysis of Llama-2 7B applied to multistep reasoning over fictional ontologies, we demonstrate that LLMs deploy multiple parallel pathways of answer generation for step-by-step reasoning. These parallel pathways provide sequential answers from the input question context as well as the generated CoT. We observe a functional rift in the middle layers of the LLM. Token representations in the initial half remain strongly biased towards the pretraining prior, with the in-context prior taking over in the later half. This internal phase shift manifests in different functional components: attention heads that write the answer token appear in the later half, attention heads that move information along ontological relationships appear in the initial half, and so on. To the best of our knowledge, this is the first attempt towards mechanistic investigation of CoT reasoning in LLMs.

URL: https://openreview.net/forum?id=uHLDkQVtyC

---

Title: CoMIX: A Multi-agent Reinforcement Learning Training Architecture for Efficient Decentralized Coordination and Independent Decision-Making

Abstract: Robust coordination skills enable agents to operate cohesively in shared environments, together towards a common goal and, ideally, individually without hindering each other's progress. To this end, this paper presents Coordinated QMIX (CoMIX), a novel training framework for decentralized agents that enables emergent coordination through flexible policies, allowing at the same time independent decision-making at individual level. CoMIX models selfish and collaborative behavior as incremental steps in each agent's decision process. This allows agents to dynamically adapt their behavior to different situations balancing independence and collaboration. Experiments using a variety of simulation environments demonstrate that CoMIX outperforms baselines on collaborative tasks. The results validate our incremental approach as effective technique for improving coordination in multi-agent systems.

URL: https://openreview.net/forum?id=JoU9khOwwr

---

Title: Towards Certainty: Exploiting Monotonicity with Fast Marching Methods to Reduce Predictive Uncertainty

Abstract: In recent years, neural networks have achieved impressive performance on a wide range of tasks. However, neural networks tend to make overly optimistic predictions about out-of-distribution data. When managing model risks, it is important to know what we do not know. Although there have been many successes in detecting out-of-distribution data, it is unclear how we can extract further information from these uncertain predictions. To address this problem, we propose to use three types of monotonicity by solving a mean-variance optimization problem. The fast marching method is proposed as an efficient solution. We demonstrate, using empirical examples, that it is possible to provide confident bounds for a large portion of uncertain predictions by monotonicity.

URL: https://openreview.net/forum?id=SFa93igP8G

---

Title: Can LLMs Effectively Leverage Graph Structural Information through Prompts, and Why?

Abstract: Large language models (LLMs) are gaining increasing attention for their capability to process graphs with rich text attributes, especially in a zero-shot fashion. Recent studies demonstrate that LLMs obtain decent text classification performance on common text-rich graph benchmarks, and the performance can be improved by appending encoded structural information as natural languages into prompts. We aim to understand why the incorporation of structural information inherent in graph data can improve the prediction performance of LLMs. First, we rule out the concern of data leakage by curating a novel leakage-free dataset and conducting a comparative analysis alongside a previously widely-used dataset. Second, as past work usually encodes the ego-graph by describing the graph structure in natural language, we ask the question: do LLMs understand the prompts in graph structures? Third, we investigate why LLMs can improve their performance after incorporating structural information.
Our exploration of these questions reveals that (i) there is no substantial evidence that the performance of LLMs is significantly attributed to data leakage; (ii) instead of understanding prompts as graph structures, LLMs tend to process prompts more as contextual paragraphs and (iii) the most efficient elements of the local neighborhood included in the prompt are phrases that are pertinent to the node label, rather than the graph structure.

URL: https://openreview.net/forum?id=L2jRavXRxs

---

Title: Decoupling Pixel Flipping and Occlusion Strategy for Consistent XAI Benchmarks

Abstract: Feature removal is a central building block for eXplainable AI (XAI), both for occlusion-based explanations (Shapley values) as well as their evaluation (pixel flipping, PF).
However, occlusion strategies can vary significantly from simple mean replacement up to inpainting with state-of-the-art diffusion models.
This ambiguity limits the usefulness of occlusion-based approaches.
For example, PF benchmarks lead to contradicting rankings.
This is amplified by competing PF measures: Features are either removed starting with most influential first (MIF) or least influential first (LIF).

This study proposes two complementary perspectives to resolve this disagreement problem.
Firstly, we address the common criticism of occlusion-based XAI, that artificial samples lead to unreliable model evaluations.
We propose to measure the reliability by the R(eference)-Out-of-Model-Scope (OMS) score.
The R-OMS score enables a systematic comparison of occlusion strategies and resolves the disagreement problem by grouping consistent PF rankings.
Secondly, we show that the insightfulness of MIF and LIF is conversely dependent on the R-OMS score.
To leverage this, we combine the MIF and LIF measures into the symmetric relevance gain (SRG) measure.
This breaks the inherent connection to the underlying occlusion strategy and leads to consistent rankings.
This resolves the disagreement problem of PF benchmarks, which we verify for a set of 40 different occlusion strategies.

URL: https://openreview.net/forum?id=bIiLXdtUVM

---

Title: Feedback-guided Data Synthesis for Imbalanced Classification

Abstract: Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT, Places-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over $4\%$ improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. Similarly, on Places-LT we achieve state-of-the-art results as well as nearly $4\%$ improvement on underrepresented classes. NICO++ also enjoys marked boosts of over $5\%$ in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.

URL: https://openreview.net/forum?id=IHJ5OohGwr

---

Title: Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

Abstract: The causal capabilities of large language models (LLMs) are a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. We conduct a “behavorial” study of LLMs to benchmark their capability in generating causal arguments. Across a wide range of tasks, we find that LLMs can generate text corresponding to correct causal arguments with high probability, surpassing the best-performing existing methods. Algorithms based on GPT-3.5 and 4 outperform existing algorithms on a pairwise causal discovery task (97%, 13 points gain), counterfactual reasoning task (92%, 20 points gain) and event causality (86% accuracy in determining necessary and sufficient causes in vignettes). We perform robustness checks across tasks and show that the capabilities cannot be explained by dataset memorization alone.
That said, LLMs exhibit unpredictable failure modes and we discuss the kinds of errors that may be improved and what are the fundamental limits of LLM-based answers. Overall, by operating on the text metadata, LLMs bring capabilities so far understood to be restricted to humans, such as using collected knowledge to generate causal graphs or identifying background causal context from natural language. As a result, LLMs may be used by human domain experts to save effort in setting up a causal analysis, one of the biggest impediments to the widespread adoption of causal methods. Given that LLMs ignore the actual data, our results also point to a fruitful research direction of developing algorithms that combine LLMs with existing causal techniques.

URL: https://openreview.net/forum?id=mqoxLkX210

---

Title: A Curious Case of Remarkable Resilience to Gradient Attacks via Fully Convolutional and Differentiable Front End with a Skip Connection

Abstract: We experimented with front-end enhanced neural models where a frozen backbone classifier was prepended by a differentiable and fully convolutional model with a skip connection. By training such composite models using a small learning rate for one epoch (or less), we obtained models that retained the accuracy of the backbone classifier while being unusually resistant to gradient attacks including APGD and FAB-T attacks from the AutoAttack package.

We provided evidence that this was due to gradient masking: Although the gradient masking phenomenon is not new, the degree of masking was quite remarkable for fully differentiable models that did not have gradient-shattering components such as JPEG compression or components that are expected to cause diminishing gradients. The training recipe to pro- duce such models was remarkably stable and reproducible as well: We applied it to three datasets (CIFAR10, CIFAR100, and ImageNet) and several types of models (including re- cently proposed vision Transformers) without a single failure case.

Although black box attacks such as the SQUARE attack and the zero-order PGD can be partially effective against gradient masking, these attacks are easily defeated by combining gradient-masking models into simple randomized ensembles. We estimate that these en- sembles achieve near-SOTA AutoAttack accuracy on CIFAR10, CIFAR100, and ImageNet (while retaining virtually all the clean accuracy of the original classifiers) despite having vir- tually zero accuracy under adaptive attacks. Quite interesting, adversarial training of the backbone classifier can further increase resistance of the front-end enhanced model to gra- dient attacks. On CIFAR10, the respective randomized ensemble achieved 90.8±2.5% (99% CI) accuracy under AutoAttack while having only 18.2±3.6% accuracy under the adaptive attack.

We do not aim to establish SOTA in adversarial robustness. Instead, our paper makes methodological contributions and further supports the thesis that adaptive attacks designed with the complete knowledge of model architecture are crucial in demonstrating model ro- bustness and that even the so-called white-box gradient attacks can have limited applica- bility. Although gradient attacks can be complemented with black-box attack such as the SQUARE attack or the zero-order PGD, black-box attacks can be weak against randomized ensembles, e.g., when ensemble models mask gradients.

Code and instructions to reproduce key results is available. https://anonymous.4open. science/r/curious_case_of_gradient_masking-2D3E

URL: https://openreview.net/forum?id=kt7Am2wHlm

---

Title: Deep Generative Models for Offline Policy Learning: Tutorial, Survey, and Perspectives on Future Directions

Abstract: Deep generative models (DGMs) have demonstrated great success across various domains, particularly in generating texts, images, and videos using models trained from offline data. Similarly, data-driven decision-making and robotic control also necessitate learning a generator function from the offline data to serve as the strategy or policy. In this case, applying deep generative models in offline policy learning exhibits great potential, and numerous studies have explored in this direction. However, this field still lacks a comprehensive review and so developments of different branches are relatively independent. Thus, we provide the first systematic review on the applications of deep generative models for offline policy learning. In particular, we cover five mainstream deep generative models, including Variational Auto-Encoders, Generative Adversarial Networks, Normalizing Flows, Transformers, and Diffusion Models, and their applications in both offline reinforcement learning (offline RL) and imitation learning (IL). Offline RL and IL are two main branches of offline policy learning and are widely-adopted techniques for sequential decision-making. Specifically, for each type of DGM-based offline policy learning, we distill its fundamental scheme, categorize related works based on the usage of the DGM, and sort out the development process of algorithms in that field. Subsequent to the main content, we provide in-depth discussions on deep generative models and offline policy learning as a summary, based on which we present our perspectives on future research directions. This work offers a hands-on reference for the research progress in deep generative models for offline policy learning, and aims to inspire improved DGM-based offline RL or IL algorithms.

URL: https://openreview.net/forum?id=Mm2cMDl9r5

---

Title: Generalized Oversampling for Learning from Imbalanced datasets and Associated Theory: Application in Regression

Abstract: In supervised learning, it is quite frequent to be confronted with real imbalanced datasets. This situation leads to a learning difficulty for standard algorithms. Research and solutions in imbalanced learning have mainly focused on classification tasks. Despite its importance, very few solutions exist for imbalanced regression. In this paper, we propose a data augmentation procedure, the GOLIATH algorithm, based on kernel density estimates and especially dedicated to the problem of imbalanced data. This general approach encompasses two large families of synthetic oversampling: those based on perturbations, such as Gaussian Noise, and those based on interpolations, such as SMOTE. It also provides an explicit form of such machine learning algorithms. New synthetic data generators are deduced. We apply GOLIATH in imbalanced regression combining such generator procedures with a new wild-bootstrap resampling technique for the target values. We evaluate the performance of the GOLIATH algorithm in imbalanced regression where we compare our approach with state-of-the-art techniques.

URL: https://openreview.net/forum?id=DLqPhQxgYu

---

Title: The Sparse Matrix-Based Random Projection: An Analysis of Matrix Sparsity for Classification

Abstract: In the paper, we study the sparse $\{0,\pm1\}$-matrix based random projection, which has been widely applied in classification to reduce data dimension. For the problem, it is interesting to estimate the optimal sparsity of sparse matrices for classification, namely the minimum number of nonzero entries $\pm1$ that supports achieving the best classification performance. To achieve this, we analyze the impact of matrix sparsity on the $\ell_1$ distance between projected data points. By principle component analysis, it is known that the larger distance between projected data points should better capture the variation among original data, and then yield better classification performance. Theoretically, the $\ell_1$ distance between projected data points is not only related to the sparsity of sparse matrices, but also to the distribution of original data. Without loss of generality, we evaluate two typical data distributions, the Gaussian mixture distribution and the two-point distribution, which have been widely used to model the distributions of real data. Given the two data distributions, it is proved that the maximum $\ell_1$ distance between projected data points could be approximated, as the sparse matrix contains only one or at most about twenty nonzero entries per row, under the size $m\geq\mathcal{O}(\sqrt{n})$. Accordingly, the best classification performance should also be achieved under such conditions. This is confirmed with extensive experiments on different types of data, including the image, text, gene and binary quantization data.

URL: https://openreview.net/forum?id=Z5j4ydrDRy

---

Title: Augment then Smooth: Reconciling Differential Privacy with Certified Robustness

Abstract: Machine learning models are susceptible to a variety of attacks that can erode trust, including attacks against the privacy of training data, and adversarial examples that jeopardize model accuracy. Differential privacy and certified robustness are effective frameworks for combating these two threats respectively, as they each provide future-proof guarantees. However, we show that standard differentially private model training is insufficient for providing strong certified robustness guarantees. Indeed, combining differential privacy and certified robustness in a single system is non-trivial, leading previous works to introduce complex training schemes that lack flexibility. In this work, we present DP-CERT, a simple and effective method that achieves both privacy and robustness guarantees simultaneously by integrating randomized smoothing into standard differentially private model training. Compared to the leading prior work, DP-CERT gives up to a 2.5$\times$ increase in certified accuracy for the same differential privacy guarantee on CIFAR10. Through in-depth per-sample metric analysis, we find that larger certifiable radii correlate with smaller local Lipschitz constants, and show that DP-CERT effectively reduces Lipschitz constants compared to other differentially private training methods.

URL: https://openreview.net/forum?id=YN0IcnXqsr

---

Title: Exploring Explicit Representations in 4D: A Comparative Analysis with HexPlane

Abstract: Modeling and re-rendering novel views of dynamic 3D scenes is a challenging problem in
3D vision. Employing implicit representations for the task, extending static NeRFs to
4D incurs high computational costs due to the numerous MLP evaluations, highlighting
the need for efficient representations of dynamic 3D scenes. Non-Nerf Methods such as
Niemeyer et al. (2019), Jiang et al. (2022), and Jiang et al. (2021) have primarily been applied
to idealized, single-subject scenes and have not yet been adapted for real-world camera
images. Cao and Johnson (2023) proposes using HexPlane, an explicit scene representation
method that factors a 4D volume into six feature planes. This paper attempts to verify
their claims and compare them with similar methods like Gaussian Splatting by Wu et al.
(2023) and K-planes by Fridovich-Keil et al. (2023). We conduct a thorough examination of
the architectural choices and design elements inherent in HexPlane and further incorporate
additional regularization to achieve a performance improvement.

URL: https://openreview.net/forum?id=dgZXa7plmh

---

Title: Learning the essential in less than 2k additional weights - a simple approach to improve image classification stability under corruptions

Abstract: The performance of image classification on well-known benchmarks such as ImageNet is remarkable, but in safety-critical situations, the accuracy often drops significantly under adverse conditions. To counteract these performance drops, we propose a very simple modification to the models: we pre-pend a single, dimension preserving convolutional layer with a large linear kernel whose purpose it is to extract the information that is essential for image classification. We show that our simple modification can increase the robustness against common corruptions significantly, especially for corruptions of high severity. We demonstrate the impact of our channel-specific layers on ImageNet-100 and ImageNette classification tasks and show an increase of up to 30\% accuracy on corrupted data in the top1 accuracy. Further, we conduct a set of designed experiments to qualify the conditions for our findings. Our main result is that a data- and network dependent linear subspace carries the most important classification information (the essential), which our proposed pre-processing layer approximately identifies for most corruptions, and at very low cost.

URL: https://openreview.net/forum?id=i2SuGWtIIm

---

Title: Systematic Exploration and Exploitation via a Markov Game with Impulse Control

Abstract: Efficient reinforcement learning (RL) involves a trade-off between
``exploitative'' actions that maximise expected reward and ``explorative''
actions that lead to the visitation of ``novel'' states. To encourage
exploration, existing methods proposed methods such as injecting
stochasticity into action selection, implicit regularisation, and synthetic
heuristic rewards. However, these techniques do not necessarily offer
systematic approach for making this trade-off. Here we introduce
\textbf{SE}lective \textbf{R}einforcement \textbf{E}xploration
\textbf{N}etwork (SEREN), a plug-and-play framework that casts the
exploration-exploitation trade-off as a Markov game between an RL agent --
exploiter, which purely exploits task-dependent rewards, and another RL
agent -- switcher, which chooses at which states to activate a \textit{pure
exploration} policy that is trained to minimise system uncertainty and
override exploiter. Using a form of policies known as \textit{impulse
control}, switcher~is able to determine the best set of states to switch to
the exploration policy while exploiter~is free to execute its actions
everywhere else. We prove that the convergence of SEREN under linear regime, and show that it induces a natural
schedule towards pure exploitation. Through extensive empirical studies in
both discrete and continuous control benchmarks, we show that with minimal
modification, SEREN can be readily combined with existing RL algorithms and
yield performance improvement.

URL: https://openreview.net/forum?id=eguqkWJBVA

---

Title: Improving Variational Autoencoder Estimation from Incomplete Data with Mixture Variational Families

Abstract: We consider the task of estimating variational autoencoders (VAEs) when the training data is incomplete. We show that missing data increases the complexity of the model’s posterior distribution over the latent variables compared to the fully-observed case. The increased complexity may adversely affect the fit of the model due to a mismatch between the variational and model posterior distributions. We introduce two strategies based on (i) finite variational-mixture and (ii) imputation-based variational-mixture distributions to address the increased posterior complexity. Through a comprehensive evaluation of the proposed approaches, we show that variational mixtures are effective at improving the accuracy of VAE estimation from incomplete data.

URL: https://openreview.net/forum?id=lLVmIvZfry

---

Title: Homogenizing Non-IID Datasets via In-Distribution Knowledge Distillation for Decentralized Learning

Abstract: Decentralized learning enables serverless training of deep neural networks (DNNs) in a distributed manner on multiple nodes. One of the key challenges with decentralized learning is heterogeneity in the data distribution across the nodes. Data heterogeneity results in slow and unstable global convergence and therefore poor generalization performance. In this paper, we propose In-Distribution Knowledge Distillation (IDKD) to address the challenge of heterogeneous data distribution. The goal of IDKD is to homogenize the data distribution across the nodes. While such data homogenization can be achieved by exchanging data among the nodes sacrificing privacy, IDKD achieves the same objective using a common public dataset across nodes without breaking the privacy constraint. This public dataset is different from the training dataset and is used to distill the knowledge from each node and communicate it to its neighbors through the generated labels. With traditional knowledge distillation, the generalization of the distilled model is reduced due to misalignment between the private and public data distribution. Thus, we introduce an Out-of-Distribution (OoD) detector at each node to label a subset of the public dataset that maps close to the local training data distribution. Our experiments on multiple image classification datasets and graph topologies show that the proposed IDKD scheme is more effective than traditional knowledge distillation and achieves state-of-the-art generalization performance on heterogeneously distributed data with minimal communication overhead.

URL: https://openreview.net/forum?id=CuyJkNjIVd

---

Title: On the numerical reliability of nonsmooth autodiff: a MaxPool case study

Abstract: This paper considers the reliability of automatic differentiation for neural networks involving the nonsmooth MaxPool operation across various precision levels (16, 32, 64 bits), architectures (LeNet, VGG, ResNet), and datasets (MNIST, CIFAR10, SVHN, ImageNet). Although AD can be incorrect, recent research has shown that it coincides with the derivative almost everywhere, even in the presence of nonsmooth operations (such as MaxPool and ReLU). On the other hand, in practice, AD operates with floating-point numbers, and there is, therefore, a need to explore subsets on which AD can be {\em numerically} incorrect. These subsets include a bifurcation zone (where AD is incorrect over reals) and a compensation zone (where AD is incorrect over floating-point numbers but correct over reals). Using SGD for the training process, we study the impact of different choices of the nonsmooth Jacobian for the MaxPool function on the precision of 16 and 32 bits. These findings suggest that nonsmooth MaxPool Jacobians with lower norms help maintain stable and efficient test accuracy, whereas those with higher norms can result in instability and decreased performance. We also observe that the influence of MaxPool's nonsmooth Jacobians on learning can be reduced by using batch normalization, Adam-like optimizers, or increasing the precision level.

URL: https://openreview.net/forum?id=142xsInVfp

---

Title: Selective Pre-training for Private Fine-tuning

Abstract: Text prediction models, when used in applications like email clients or word processors, must protect user data privacy and adhere to model size constraints. These constraints are crucial to meet memory and inference time requirements, as well as to reduce inference costs. Building small, fast, and private domain-specific language models is a thriving area of research. In this work, we show that a careful pre-training on a subset of the public dataset that is guided by the private dataset is crucial to train small language models with differential privacy. On standard benchmarks, small models trained with our new framework achieve state-of-the-art performance. In addition to performance improvements, our results demonstrate that smaller models, through careful pre-training and private fine-tuning, can match the performance of much larger models that do not have access to private data. This underscores the potential of private learning for model compression and enhanced efficiency.

URL: https://openreview.net/forum?id=y3u8OpPHxz

---

Title: On Exact Solutions of the Inner Optimization Problem of Adversarial Robustness

Abstract: In this work, we propose a robust framework that employs adversarially robust training to safeguard the ML models against perturbed testing data. Our contributions can be seen from both computational and statistical perspectives. Firstly, from a computational/optimization point of view, we derive the ready-to-use exact solution for several widely used loss functions with a variety of norm constraints on adversarial perturbation for various supervised and unsupervised ML problems, including regression, classification, two-layer neural networks, graphical models, and matrix completion. The solutions are either in closed-form, or an easily tractable optimization problem such as 1-D convex optimization, semidefinite programming, difference of convex programming or a sorting-based algorithm. Secondly, from statistical/generalization viewpoint, using some of these results, we derive novel bounds of the adversarial Rademacher complexity for various problems, which entails new generalization bounds. Thirdly, we validate our approach by showing significant performance improvement on real-world datasets over various gradient ascent based baselines for supervised problems such as regression and classification, as well as for unsupervised problems such as matrix completion and learning graphical models, with very little computational overhead.

URL: https://openreview.net/forum?id=JlE1MMyy0h

---

Title: Improve Certified Training with Signal-to-Noise Ratio Loss to Decrease Neuron Variance and Increase Neuron Stability

Abstract: This work addresses the issue of over-regularization in certified training, which often results in lower certified robustness.
By introducing the concepts of neuron variance and neuron stability, we delve into their roles in inducing over-regularization and affecting the model's certified robustness. To tackle the problem, we extend the Signal-to-Noise Ratio (SNR) into the realm of model robustness, offering a novel perspective and developing SNR-inspired losses aimed at optimizing neuron variance and stability to mitigate over-regularization. Through both empirical and theoretical analyses, our SNR-based approach demonstrates superior performance over existing methods on the MNIST and CIFAR-10 datasets. Further, our exploration into adversarial training uncovers a beneficial correlation between neuron variance and adversarial robustness, leading to an optimized balance between standard and robust accuracy that outperforms the baseline method.

URL: https://openreview.net/forum?id=iV0jktFZ5Y

---

Title: Independence Testing for Temporal Data

Abstract: Temporal data are increasingly prevalent in modern data science. A fundamental question is whether two time series are related or not. Existing approaches often have limitations, such as relying on parametric assumptions, detecting only linear associations, and requiring multiple tests and corrections. While many non-parametric and universally consistent dependence measures have recently been proposed, directly applying them to temporal data can inflate the p-value and result in an invalid test. To address these challenges, this paper introduces the temporal dependence statistic with block permutation to test independence between temporal data. Under proper assumptions, the proposed procedure is asymptotically valid and universally consistent for testing independence between stationary time series, and capable of estimating the optimal dependence lag that maximizes the dependence. Moreover, it is compatible with a rich family of distance and kernel based dependence measures, eliminates the need for multiple testing, and exhibits excellent testing power in various simulation settings.

URL: https://openreview.net/forum?id=jv1aPQINc4

---

Title: Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey

Abstract: Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

URL: https://openreview.net/forum?id=IZnrCGF9WI

---

Title: SwinGNN: Rethinking Permutation Invariance in Diffusion Models for Graph Generation

Abstract: Permutation-invariant diffusion models of graphs achieve the invariant sampling and invariant loss functions by restricting architecture designs, which often sacrifice empirical performances. In this work, we first show that the performance degradation may also be contributed by the increasing modes of target distributions brought by invariant architectures since 1) the optimal one-step denoising scores are score functions of Gaussian mixtures models (GMMs) whose components center on these modes and 2) learning the scores of GMMs with more components is often harder. Motivated by the analysis, we propose SwinGNN along with a simple yet provable trick that enables permutation-invariant sampling. It benefits from more flexible (non-invariant) architecture designs and permutation-invariant sampling. We further design an efficient 2-WL message passing network using the shifted-window self-attention. Extensive experiments on synthetic and real-world protein and molecule datasets show that SwinGNN outperforms existing methods by a substantial margin on most metrics.

URL: https://openreview.net/forum?id=abfi5plvQ4

---

Title: Bayesian Optimization for minimizing CVaR under performance constraints

Abstract: Optimal portfolio allocation can often be formulated as to a constrained risk problem, where one aims to minimize a risk measure subject to some performance constraints. This paper presents new Bayesian Optimization (BO) algorithms for such constrained minimization problems, seeking to minimize the conditional value-at-risk a computationally intensive risk measure, under a minimum expected return constraint. The proposed algorithms utilize a new acquisition function, which drives sampling towards the optimal region. Additionally, a new two-stage procedure is developed, which significantly reduces the number of evalua- tions of the expensive-to-evaluate objective function. The proposed algorithm’s competitive performance is demonstrated through practical examples.

URL: https://openreview.net/forum?id=PELWWWqyyJ

---

Title: A Primal-Dual Approach to Bilevel Optimization with Multiple Inner Minima

Abstract: Bilevel optimization has found extensive applications in modern machine learning problems such as hyperparameter optimization, neural architecture search, meta-learning, etc. While bilevel problems with a unique inner minimal point (e.g., where the inner function is strongly convex) are well understood, such a problem with multiple inner minimal points remains to be challenging and open. Existing algorithms designed for such a problem were applicable to restricted situations and do not come with a full guarantee of convergence. In this paper, we adopt a reformulation of bilevel optimization to constrained optimization, and solve the problem via a primal-dual bilevel optimization (PDBO) algorithm. PDBO not only addresses the multiple inner minima challenge, but also features fully first-order efficiency without involving second-order Hessian and Jacobian computations, as opposed to most existing gradient-based bilevel algorithms. We further characterize the convergence rate of PDBO, which serves as the first known non-asymptotic convergence guarantee for bilevel optimization with multiple inner minima. Our experiments demonstrate desired performance of the proposed approach.

URL: https://openreview.net/forum?id=QO9RbXSPl3

---

Title: Particle-based Online Bayesian Sampling

Abstract: Online learning has gained increasing interest due to its capability of tracking real-world streaming data. Although it has been widely studied in the setting of frequentist statistics, few works have considered online learning with the Bayesian sampling problem. In this paper, we study an Online Particle-based Variational Inference (OPVI) algorithm that updates a set of particles to gradually approximate the Bayesian posterior. To reduce the gradient error caused by the use of stochastic approximation, we include a sublinear increasing batch-size method to reduce the variance. To track the performance of the OPVI algorithm concerning a sequence of dynamically changing target posterior, we provide the first theoretical analysis for the dynamic regret from the perspective of Wasserstein gradient flow. Experimental results on the Bayesian Neural Network show that the proposed algorithm achieves up to 20\% improvement than naively applying existing Bayesian sampling methods in the online setting.

URL: https://openreview.net/forum?id=o7MrRBj8Qe

---

Title: Adversarial Attacks on Online Learning to Rank with Stochastic Click Models

Abstract: We propose the first study of adversarial attacks on online learning to rank. The goal of the attacker it to misguide the online learning to rank algorithm to place the target item on top of the ranking list linear times to time horizon $T$ with a sublinear attack cost. We propose generalized list poisoning attacks that perturb the ranking list presented to the user. This strategy can efficiently attack any no-regret ranker in general stochastic click models. Furthermore, we propose a click poisoning-based strategy named attack-then-quit that can efficiently attack two representative OLTR algorithms for stochastic click models. We theoretically analyze the success and cost upper bound of the two proposed methods. Experimental results based on synthetic and real-world data further validate the effectiveness and cost-efficiency of the proposed attack strategies.

URL: https://openreview.net/forum?id=BKwGowR0Bt

---

Title: Reward-based Autonomous Online Learning Framework for Resilient Cooperative Target Monitoring using a Swarm of Robots

Abstract: This paper addresses the problem of decentralized cooperative monitoring of an agile target using a swarm of robots undergoing dynamic sensor failures. Each robot is equipped with a proprioceptive sensor suite for the estimation of its own pose and an exteroceptive sensor suite for target detection and position estimation with a limited field of view. Further, the robots use broadcast-based communication modules with a limited communication radius and bandwidth. The uncertainty in the system and the environment can lead to intermittent communication link drops, target visual loss, and large biases in the sensors’ estimation output due to temporary or permanent failures. Robotic swarms often operate without leaders, supervisors, or landmarks, i.e., without the availability of ground truth regarding pose information. In such scenarios, each robot is required to exhibit autonomous learning by taking charge of its own learning process while making the most out of available information. In this regard, a novel Autonomous Online Learning (AOL) framework has been proposed, in which a decentralized online learning mechanism driven by reward-like signals, is intertwined with an implicit adaptive consensus-based, two-layered, weighted information fusion process that utilizes the robots’ observations and their shared information, thereby ensuring resilience in the robotic swarm. In order to study the effect of loss or reward design in the local and social learning layers, three AOL variants are presented. A novel perturbation-greedy reward design is introduced in the learning layers of two variants, leading to exploration-exploitation in their information fusion's weights' space. Convergence analysis of the weights is carried out, showing that the weights converge under reasonable assumptions. Simulation results show that the AOL variant using the perturbation-greedy reward in its local learning layer performs the best, doing $182.2\%$ to $652\%$ and $94.7\%$ to $150.4\%$ better than the baselines in terms of detection score and closeness score per robot, respectively, as the total number of robots is increased from $5$ to $30$. Further, AOL's Sim2Real implementation has been validated using a ROS-Gazebo setup.

URL: https://openreview.net/forum?id=PzmaWLqK0e

---

Title: Unmasking the Veil: An Investigation into Concept Ablation for Privacy and Copyright Protection in Images

Abstract: In this paper, we extend the study of concept ablation within pre-trained models as introduced in 'Ablating Concepts in Text-to-Image Diffusion Models' by $\citep{Kumari2022}$. Our work focuses on reproducing the results achieved by the different variants of concept ablation proposed through predefined metrics. We also introduce a novel variant of concept ablation—trademark ablation. This variant combines the principles of memorization and instance ablation to tackle the nuanced influence of proprietary or branded elements in model outputs. Further, our research contributions include an observational analysis of the model's limitations. Moreover, we investigate the model's behavior in response to ablation leakage-inducing prompts, which aim to indirectly ablate concepts, revealing insights into the model's resilience and adaptability. We also observe the model's performance degradation on images generated by concepts far from its target ablation concept, which is documented in the appendix.

URL: https://openreview.net/forum?id=TYYApLzjaQ

---

Title: [Re] Reproducibility Study of Equal Improvability Fairness Notion

Abstract: Our research validates and expands the Equal Improvability (EI) framework, which aims to equalize acceptance rates across different groups by quantifying required improvement efforts, thereby enhancing long-term fairness. By replicating the original findings, we reaffirm EI's foundational claims. Additionally, extended experiments are conducted to probe EI's efficacy under varied scenarios. To enhance long-term fairness, we propose non-parametric updates and Chi-square fit to generalize the dataset, in contrast to the Gaussian distribution dataset from the original study. Our analysis shows that the EI framework struggles with adapting to the Chi-square fit and exhibits even poorer performance with non-parametric updates in long-term scenarios, indicating challenges in dynamic distribution scenarios.
The update rule is modified to align more with theorem and intuition. It is proved that EI is more robust to noise compared with the other notions. The examination of varying decision fractions uncovers the conditional robustness of EI across different acceptance rates.
These experiments and highlight the strengths of EI in certain contexts and its limitations in others, providing a nuanced understanding of its applicability and areas for improvement in the pursuit of fairness in machine learning.

URL: https://openreview.net/forum?id=64GPyKTZJX

---

Title: [Re] $p$-Poisson surface reconstruction in curl-free flow from point clouds

Abstract: This study presents a reproducibility analysis of the $p$-Poisson surface reconstruction method presented by Park et al. (NeurIPS 2023). The method utilizes the $p$-Poisson equation and a curl-free constraint for improved surface reconstruction from point clouds, claiming significant advancements over existing implicit neural representation techniques. This study evaluates the reproducibility and generalizability of the results reported in the original paper, focusing on the evaluation using the Surface Reconstruction Benchmark (SRB) dataset. The neural network architecture and training procedures are entirely re-implemented from scratch, emphasizing correctness and efficient execution. While the replication generally outperforms the four alternative methods mentioned in the original paper, the distance results reported in the original paper fail to be reproduced by the re-implementation. Notably, training with the code published in the original paper yields similar results to the reproduced results, still deviating from the findings presented in the original paper. The presented implementation demonstrates a significant improvement in training performance, achieving a five-fold acceleration in training times compared to the code used in the original paper by vectorizing the gradient calculations and leveraging just-in-time compilation of the training loop, which gives an actionable insight for others to explore and integrate such optimizations into their machine learning code. The re-implementation is available at \footnote{\url{https://anonymous.4open.science/r/pinc-B7CD}}.

URL: https://openreview.net/forum?id=zbAUDVaNiX

---

Title: Reproducibility study of "FairLISA: Fair User Modeling with Limited Sensitive Attributes Information"

Abstract: This is a reproducibility study of the paper "FairLISA: Fair User Modeling with Limited Sensitive Attributes Information" by Zhang et al. (2023). It proposes a method of increasing fairness in user modeling tasks, by filtering out sensitive information from user embeddings. In contrast to other fairness aware methods, FairLISA is designed for filtering data with both known and unknown sensitive attributes. In this paper we explain the method from the paper, the claims about the effectiveness of the method, and our process of attempting to recreate said claims. We test the reproducibility of their original claims, test the generalisability of their method, and provide our implementation of the FairLISA method so further research can be done. We conclude that none of the claims of the original paper are fully reproducible in a reasonable amount of time. Some of the claims were able to be partially reproduced, and we detail those results.

URL: https://openreview.net/forum?id=bz6oVIjRDI

---

Title: Evaluating In-Sample Softmax in Offline Reinforcement Learning: An Analysis Across Diverse Environments

Abstract: In this work, we considered the problem of learning action-values and corresponding policies from a fixed batch of data. The algorithms designed for this setting need to account for the fact that the action-coverage of the data distribution may be incomplete, that is certain state-action transitions are not present in the dataset. The core issue faced by Offline RL methods is insufficient action-coverage which leads to overestimation or divergence in learning during the bootstrapping update. We critically examine the In-Sample Softmax (INAC) algorithm for Offline Reinforcement Learning (RL), addressing the challenge of learning effective policies from pre-collected data without further environmental interaction using an in-sample softmax. Through extensive analysis and comparison with other in-sample algorithms like In-sample Actor-Critic (IAC) and Batch-Constrained Q-learning (BCQ) , we investigate INAC's efficacy across various environments, including tabular, continuous, and discrete domains, as well as imbalanced datasets. We find that the INAC, when benchmarked against state-of-the-art offline RL algorithms, demonstrates robustness to variations in data distribution and performs comparably, if not superiorly, in all scenarios. We do a comprehensive evaluation of the capabilities and the limitations of the In-Sample Softmax method within the broader context of offline reinforcement learning.

URL: https://openreview.net/forum?id=0HDQXKwRFU

---

Title: Exploring Exploration: A Comparative Analysis of Colored Noise Strategies in Reinforcement Learning

Abstract: Reinforcement Learning algorithms, in general, and off-policy agents navigating continuous control spaces, in particular, often induce exploration through the addition of noise into their action selection process. Popular implementations majorly utilize uncorrelated Gaussian (white) noise, or temporally correlated Ornstein-Uhlenbeck (OU) noise, which is closely related to red noise. Recent works propose using pink noise, which is halfway between white and OU noise, as the default action noise type. They claim pink noise to be a better default than noise schedulers, which are algorithms that vary the level of temporal correlation as learning progresses. In this paper, we attempt to verify their claims and present an analysis of colored noise exploration, comparing various strategies of noise integration. We further attempt to identify the effect of using spatially and temporally correlated noise to achieve exploration. The code and samples are present in the supplementary material.

URL: https://openreview.net/forum?id=M5x14QVUOg

---

Title: Chain-of-Thought Unfaithfulness as Disguised Accuracy

Abstract: Understanding the extent to which Chain-of-Thought (CoT) generations align with a large
language model’s (LLM) internal computations is critical for deciding whether to trust an
LLM’s output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that
measures a model’s dependence on its CoT for producing an answer. Within a single family
of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship
between model size and their measure of faithfulness, and that a 13 billion parameter model
exhibits increased faithfulness compared to models ranging from 810 million to 175 billion
parameters in size. We evaluate whether these results generalize as a property of all LLMs.
We replicate their experimental setup with three different families of models and, under
specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report.
However, we discover that simply changing the order of answer choices in the prompt can
reduce the metric by 73 percentage points. The faithfulness metric is also highly correlated
($R^2 = 0.91$) with accuracy, raising doubts about its validity as a construct for evaluating
faithfulness.

URL: https://openreview.net/forum?id=ydcrP55u2e

---

Title: XAudit : A Learning-Theoretic Look at Auditing with Explanations

Abstract: Responsible use of machine learning requires models to be audited for undesirable properties. While a body of work has proposed using explanations for auditing, how to do so and why has remained relatively ill-understood. This work formalizes the role of explanations in auditing using inspirations from active learning and investigates if and how model explanations can help audits. As an instantiation of our framework, we propose explanation-based algorithms for auditing linear classifiers and decision trees for `feature sensitivity'. Our results illustrate that Counterfactual explanations are extremely helpful for auditing, even in the worst-case. While Anchor explanations and decision paths may not be as beneficial in the worst-case, in the average-case they do aid significantly.

URL: https://openreview.net/forum?id=gPtjyzXskg

---

Title: Can AI-Generated Text be Reliably Detected?

Abstract: The rapid progress of Large Language Models (LLMs) has made them capable of performing astonishingly well on various tasks, including document completion and question answering. The unregulated use of these models, however, can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them. In this paper, we show that these detectors are not reliable in practical scenarios. In particular, we develop a recursive paraphrasing attack to apply on AI text, which can break a whole range of detectors, including the ones using the watermarking schemes as well as neural network-based detectors, zero-shot classifiers, and retrieval-based detectors. Our experiments include passages around 300 tokens in length, showing the sensitivity of the detectors even in the case of relatively long passages. We also observe that our recursive paraphrasing only degrades text quality slightly, measured via human studies, and metrics such as perplexity scores and accuracy on text benchmarks. Additionally, we show that even LLMs protected by watermarking schemes can be vulnerable against spoofing attacks aimed to mislead detectors to classify human-written text as AI-generated, potentially causing reputational damages to the developers. In particular, we show that an adversary can infer hidden AI text signatures of the LLM outputs without having white-box access to the detection method. Finally, we provide a theoretical connection between the AUROC of the best possible detector and the Total Variation distance between human and AI text distributions that can be used to study the fundamental hardness of the reliable detection problem for advanced language models.

URL: https://openreview.net/forum?id=1YYpg9tdVb

---

Title: Reproducibility Study on Adversarial Attacks Against Robust Transformer Trackers

Abstract: New transformer networks have been integrated into object tracking pipelines and have demonstrated strong performance on the latest benchmarks. This paper focuses on understanding how transformer trackers behave under adversarial attacks and how different attacks perform on tracking datasets as their parameters change. We conducted a series of experiments to evaluate the effectiveness of existing adversarial attacks on object trackers with transformer and non-transformer backbones. We experimented on 7 different trackers, including 3 that are transformer-based, and 4 which leverage other architectures. These trackers are tested against 4 recent attack methods to assess their performance and robustness on VOT2022ST, UAV123 and GOT10k datasets. Our empirical study focuses on evaluating adversarial robustness of object trackers based on bounding box versus binary mask predictions, and attack methods at different levels of perturbations. Interestingly, our study found that altering the perturbation level may not significantly affect the overall object tracking results after the attack. Similarly, the sparsity and imperceptibility of the attack perturbations may remain stable against perturbation level shifts. By applying a specific attack on all transformer trackers, we show that new transformer trackers having a stronger cross-attention modeling achieve a greater adversarial robustness on tracking datasets, such as VOT2022ST and GOT10k. Our results also indicate the necessity for new attack methods to effectively tackle the latest types of transformer trackers.

URL: https://openreview.net/forum?id=FEEKR0Vl9s

---

Title: Reproducibility Study of "CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification"

Abstract: This report is a reproducibility study of the paper "CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification" published at ICCV 2023. Our report makes the following contributions: (1) We provide a reproducible, well commented and open-sourced code implementation for the entire method specified in the original paper. (2) We try to verify the effectiveness of the novel aggregation strategy which uses the CLIP model to initialize the pseudo labels for the subsequent unsupervised multi-label image classification task. (3) We try to verify the effectiveness of the gradient-alignment training method specified in the original paper, which is used to update the network parameters and pseudo labels.

URL: https://openreview.net/forum?id=l41hEv7NT4

---

Title: Reproducibility study of “LICO: Explainable Models with Language-Image Consistency"

Abstract: The growing reproducibility crisis in machine learning has brought forward a need for careful examination of research findings. This paper investigates the claims made by Lei et al. (2023) regarding their proposed method, LICO, for enhancing post-hoc interpretability techniques and improving image classification performance. LICO leverages natural language supervision from a vision-language model to enrich feature representations and guide the learning process. We conduct a comprehensive reproducibility study, employing (Wide) ResNets and established interpretability methods like Grad-CAM and RISE. We were mostly unable to reproduce the authors' results. In particular, we did not find that LICO consistently led to improved classification performance or improvements in quantitative and qualitative measures of interpretability. Thus, our findings highlight the importance of rigorous evaluation and transparent reporting in interpretability research.

URL: https://openreview.net/forum?id=Mf1H8X5DVb

---

Title: Reproducibility Study of "Languange-Image COnsistency"

Abstract: This report aims to verify the findings and expand upon the evaluation and training methods from the paper LICO: Explainable Models with Language-Image COnsistency. The
main claims are that LICO (i) enhances interpretability by producing more explainable
saliency maps in conjunction with a post-hoc explainability method and (ii) improves image classification performance without computational overhead during inference. We have
reproduced the key experiments conducted by Lei et al.; however, the obtained results
do not support the original claims. Additionally, we identify a vulnerability in the paper’s main evaluation method that favors non-robust models, and propose robust experimental setups for quantitative analysis. Furthermore, we undertake additional studies
on LICO’s training methodology to enhance its interpretability. Our code is available at
https://anonymous.4open.science/r/lico-reproduction-7FEB.

URL: https://openreview.net/forum?id=FvxTseSYRk

---

Title: Reproducibility Study of “Explaining Temporal Graph Models Through an Explorer-Navigator Framework"

Abstract: This paper seeks to reproduce and extend the results of the paper “Explaining Temporal Graph Models Through an Explorer-Navigator Framework” by (Xia et al., 2023). The main contribution of the original authors is a novel explainer for temporal graph networks, the Temporal GNN Explainer (T-GNNExplainer), which finds a subset of preceding events that “explain” a prediction made by a temporal graph model. The explorer is tested on two temporal graph models that are trained on two real-world and two synthetic datasets. The explorer is evaluated using a newly proposed metric for explanatory graph models. The authors compare the performance of their explorer to three baseline explainer methods, either adapted from a GNN explainer or developed by the authors. The authors claim that T-GNNExplainer achieves superior performance compared to the baselines when evaluated with their proposed metric. This work reproduces the original experiments by using the code (with minor adjustments), model specifications, and hyperparameters provided by the original authors. To evaluate the robustness of these claims, the method was extended to one new dataset (MOOC). Results show that the T-GNNexplainer performs best on some, but not all metrics as reported in the original findings. We conclude that the main lines of this paper hold up even though all results are less pronounced than claimed. Results show that the T-GNNExplainer does not perform similarly across different T-GNN models, precise dataset specifications are needed to obtain high performance, and there are simpler, less computationally costly explainer methods (like PBONE) that could offer competitive results.

URL: https://openreview.net/forum?id=9M2XqvH2SB

---

Title: On the Analysis and Reproduction of "Post-hoc Concept Bottleneck Models" with an Extension to the Audio Domain

Abstract: Although deep neural networks are powerful tools, they are yet considered "black boxes". With the proliferation of AI models, the need for their interpretability has increased. One way to improve the interpretability of deep neural networks is to understand their decisions in terms of human-understandable concepts. Concept Bottleneck Models (CBMs) aim to achieve this goal by using embedding representations of concepts into the model, providing explainability into the decisions that a network makes. However, CBMs have various limitations concerning training efficiency and task applicability. The authors of the paper Post-hoc Concept Bottleneck Models (PCBMs) provide a novel approach to creating CBMs in a more efficient and generalizable way. In this paper, we evaluate their claims, namely, that PCBMs can be trained using any pre-trained neural network and that PCBMs offer interpretability without sacrificing significant performance. To do so, we not only attempted to reproduce the original paper results but also extended the approach into the audio domain. Our results show good alignment with the original paper but further analysis revealed some problems PCBMs may have, namely, challenges in getting a suitable list of relevant human-understandable concepts for a given task, and potential misalignment between concept encoders and input feature encoders. The code for our paper can be found at https://anonymous.4open.science/r/-354E/

URL: https://openreview.net/forum?id=xgA3Cw48EX

---

Title: On the Reproducibility of: Improvement-Focused Causal Recourse

Abstract: This work aims to reproduce the main findings of “Improvement-Focused Causal Recourse (ICR)“(König et al., 2023) within the field of algorithmic recourse recommendations. The authors demonstrate that acceptance-focused recourse recommendation methods, like counterfactual explanations (CE), may suggest actions that revert the model’s verdict by gaming the predictor whenever possible. To tackle this, the authors introduce ICR, which focuses on improvement by optimizing for a new target variable in their causal model. It is also demonstrated that improvement guarantees consequently translate into acceptance guarantees. We can confirm the findings of the original paper. The contribution of the current study is a more extensive assessment of the robustness and generalizability of ICR. Various techniques were employed to test the algorithm’s performance under different architectural choices, such as different classifiers or optimization methods, data and model shifts, and a new dataset. Our findings suggest that ICR is more robust than CE and causal recourse (CR).

URL: https://openreview.net/forum?id=QEaGochDK7

---

Title: Reproducibility Study: Equal Improvability: A New Fairness Notion Considering the Long-Term Impact

Abstract: This reproducibility study aims to evaluate the robustness of Equal Improvability (EI) - an effort-based framework for ensuring long-term fairness. To this end, we seek to analyze the three proposed EI-ensuring regularization techniques, i.e. Covariance-based, KDE-based, and Loss-based EI. Our findings largely substantiate the initial assertions, demonstrating EI’s enhanced performance over Empirical Risk Minimization (ERM) techniques on various test datasets. Furthermore, while affirming the long-term effectiveness in fairness, the study also uncovers challenges in resilience to overfitting, particularly in highly complex models.
Building upon the original study, the experiments were extended to include a new dataset and multiple sensitive attributes. These additional tests further demonstrated the effec- tiveness of the EI approach, reinforcing its continued success. Our study highlights the importance of adaptable strategies in AI fairness, contributing to the ongoing discourse in this field of research.

URL: https://openreview.net/forum?id=Yj8fUQGXXL

---

Title: “Studying How to Efficiently and Effectively Guide Models with Explanations” - A Reproducibility Study

Abstract: Model guidance describes the approach of regularizing the explanations of a deep neu-
ral network model towards highlighting the correct features to ensure that the model is
“right for the right reasons”. Rao et al. (2023) conducted an in-depth evaluation of ef-
fective and efficient model guidance for object classification across various loss functions,
attributions methods, models, and ’guidance depths’ to study the effectiveness of differ-
ent methods. Our work aims to (1) reproduce the main results obtained by Rao et al.
(2023), and (2) propose several extensions to their research. We conclude that the major
part of the original work is reproducible, with certain minor exceptions, which we discuss
in this paper. In our extended work, we point to an issue with the Energy Pointing Game
(EPG) metric used for evaluation and propose an extension for increasing its robustness.
In addition, we observe the EPG metric’s predisposition towards favoring larger bounding
boxes, a bias we address by incorporating a corrective penalty term into the original En-
ergy loss function. Furthermore, we revisit the feasibility of using segmentation masks in
light of the original study’s finding that minimal annotated data can significantly boost
model performance. Our findings suggests that Energy loss inherently guides models to
on-object features without the requirement for segmentation masks. Finally, we explore
the role of contextual information in object detection and, contrary to the assumption
that focusing solely on object-specific features suffices for accurate classification, our find-
ings suggest the importance of contextual cues in certain scenarios. Code available at:
https://anonymous.4open.science/r/model_guidance_repro_study.

URL: https://openreview.net/forum?id=9ZzASCVhDF

---

Title: [Re]CUDA: Curriculum of Data Augmentation for Long‐Tailed Recognition

Abstract: In this reproducibility study, we present our results and experience during replicating the paper, titled CUDA: Curriculum of Data Augmentation for Long-Tailed Recognition(Ahn et al., 2023).Traditional datasets used in image recognition, such as ImageNet, are often synthetically balanced, meaning each class has an equal number of samples. In practical scenarios, datasets frequently
exhibit significant class imbalances, with certain classes having a disproportionately larger number of samples compared to others. This discrepancy poses a challenge for traditional image recognition models, as they tend to favor classes with larger sample sizes, leading to poor performance on minority classes. CUDA proposes a class-wise data augmentation technique which can be used over
any existing model to improve the accuracy for LTR: Long Tailed Recognition. We successfully replicated all of the results pertaining to the long-tailed CIFAR-100-LT dataset and extended our analysis to provide deeper insights into how CUDA efficiently tackles class imbalance. The code and the readings are available in https://anonymous.4open.science/r/CUDA_readings-DBEB.

URL: https://openreview.net/forum?id=Wm6d44I8St

---

Title: Reproducing Improvemement-Focused Causal Recourse

Abstract: Reproducibility Summary


Scope of Reproducibility - In this work, we evaluate the reproducibility of the paper Improvement-
Focused Causal Recourse (ICR) by König et al. (2023). Our goal is to reproduce the paper’s four main
claims: (1) Do CE, CR and ICR lead to improvement? (2) Do CE, CR and ICR lead to acceptance (by pre-
and post- post-recourse predictor) ? (3) Do CE, CR and ICR lead to acceptance by other predictors with
comparable test error? (4) How costly are CE, CR and ICR recommendations ?

Methodology — The authors of the paper provide an implementation in PyTorch for their proposed
techniques and experiments. We reuse and extend their code for our additional experiments. The com-
putational cost for running the experiments mentioned in the paper is 110 GPU hours using an NVIDIA
A100-SXM4-40GB MIG 3g.20gb accelerator. Additionally, we took 317 GPU hours to reproduce the results
for our extended experiments.

Results — We reproduced the original paper’s work through our experiments. We find that the main claims
of the paper largely hold. We assess the robustness and generalizability of some of the claims, through our
additional experiments. In that case, we found that one claim is not reproducible for our own synthesized
4-var SCM and also found a bug in the code for the 5-var SCM. Experiments are conducted with and without
this bug.

What was easy — The commands to run the different Structural Causal Models with different confidence
and hyper-parameter settings is well documented. All the relevant plots used in the paper are generated by
a single command. Also, the names for causal variables used for the SCMs in the paper resemble with those
used in the source code, which made the code interpretable.

What was difficult — We could not run the experiment for 10 iterations as mentioned in the paper due
to time and resource constraints. Additionally, the authors used different random seeds for each experiment
which was not documented anywhere. We faced minor integer typecasting errors in the code which were
fixed from our end.

Communication with original authors — We reached out to the authors once about our queries
regarding the assumptions and contexts of some sub-claims in the paper. We received a prompt response
which satisfied most of our questions.

URL: https://openreview.net/forum?id=qI6uqbSQ47

---

Title: [RE] A Reproducibility Study on Scene-Graph Generation from 3D Point Clouds: Hybrid Approach with Clip, 2D Image Semantics, and 3D Geometry

Abstract: Reproducibility Summary

Scope of Reproducibility
This paper scrutinizes the reproducibility of VL-SAT and multimodal learning systems for 3D semantic scene graph prediction. Leveraging visual (ViT, CLIP) and linguistic semantics, our study replicates top-k accuracy results and explores models like SGFN, and SGGPoint. We assess the impact of the CLIP adapter, 2D image semantics, and conduct hyperparameter tuning. Additionally, the ablation study investigates node and edge collaboration, and the influence of a multi-head self-attention network within the VL-SAT architecture, enhancing understanding of these critical components. %\footnote{Our code can be accessed at \url{https://github.com/dnabanita7/CVPR2023-VLSAT-reproducibility/}.} %%Commented out while in double blind review


Methodology
We use the open-source code released by the authors to generate datasets, create point cloud data, and train and validate samples for VL-SAT. Our implementation covers 150 3D reconstructed indoor scenes from the original 1553, maintaining the 160 object classes and 26 predicate types as outlined in the paper. Additionally, we collaborate with the authors to integrate code for models SGFN, and SGGPoint into our existing code-base. Expanding upon the methodology, we meticulously implement the provided specifications, addressing any gaps to ensure a comprehensive pipeline supporting all experiments. Our experimentation uses computational resources provided by an NVIDIA GeForce GTX 3090 GPU, totalling 100 GPU hours for training. Moreover, we secure access to GPU compute resources through collaboration with the ML Collective team.

Results
Upon executing the authors' provided code, we encountered the necessity for substantial modifications and additions, including the incorporation of numerous files. Following these adjustments and the addition of essential segments, we conducted reproducibility tests, ablation studies, and hyperparameter tuning. Consequently, our results largely support the main claims of the paper within a significant subset of experiments. However, there are notable discrepancies in many of the actual values obtained compared to those reported. Hence, we conclude that while the paper's findings are largely replicable, achieving precise reproducibility of results requires additional efforts due to the extensive changes and additions required in the provided code.

What was easy
We found it easy to discern the primary assertions of the paper and the corresponding experimental evidence. Furthermore, the availability of the authors' open-source implementation facilitated ease in training the model, conducting ablation studies, and fine-tuning hyperparameters.

What was difficult
Configuring the datasets presented challenges primarily due to the absence of pinned dependencies, and the lack of code for generating 3D datasets resulted in delays in conducting experiments. Additionally, identifying the sources of discrepancies in our findings proved challenging, compounded by the inaccessibility of training curves and model weights or checkpoints. These limitations hindered our ability to precisely replicate the reported results and necessitated additional efforts in troubleshooting and refining our implementation.

Communication with original authors
At the initiation of our research endeavour, we diligently maintained ongoing communication with the authors through email channels which benefited us with their valuable insights and resources, thereby enhancing the depth and scope of our study. However, subsequent to the integration of code for the models under investigation, our attempts to engage in further correspondence with the authors were met with silence.

URL: https://openreview.net/forum?id=uqQGhTyTN7

---

Title: Reproducibility Study of "ITI-GEN: Inclusive Text-to-Image Generation"

Abstract: Text-to-image generative models often present issues regarding fairness with respect to certain sensitive attributes, such as gender or skin tone. This study aims to reproduce the results presented in "ITI-GEN: Inclusive Text-to-Image Generation" by Zhang et al. (2023), which introduces a model to improve inclusiveness in these kinds of models. We show that most of the claims made by the authors about ITI-GEN hold: it improves the diversity and quality of generated images, it is scalable to different domains, it has plug-and-play capabilities, and it is efficient from a computational point of view. However, ITI-GEN sometimes uses undesired attributes as proxy features and it is unable to disentangle some pairs of (correlated) attributes such as gender and baldness. In addition, when the number of considered attributes increases, the training time grows exponentially and ITI-GEN struggles to generate inclusive images for all elements in the joint distribution. To solve these issues, we propose using Hard Prompt Search with negative prompting, a method that does not require training and that handles negation better than vanilla Hard Prompt Search. Nonetheless, Hard Prompt Search (with or without negative prompting) cannot be used for continuous attributes that are hard to express in natural language, an area where ITI-GEN excels as it is guided by images during training. Finally, we propose combining ITI-GEN and Hard Prompt Search with negative prompting.

URL: https://openreview.net/forum?id=d3Vj360Wi2

---

Title: Reproducibility Study of "Improvement-Focused Causal Recourse (ICR)"

Abstract: This paper presents a reproducibility study of the "Improvement-Focused Causal Recourse (ICR)" model, a novel approach in the field of algorithmic recourse and fairness. The original work by König et al. (2023) introduces ICR as a method to ensure that interventions in predictive models not only achieve the desired outcome (acceptance) but also lead to genuine improvement in real-world situations. Our study aims to validate and replicate the key claims of the original paper by conducting experiments across four datasets, including fully synthetic and semi-synthetic data. We specifically focus on four main claims: (1) ICR's effectiveness in scenarios where gaming is lucrative, (2) ICR’s ability to achieve acceptance rates comparable to traditional methods like counterfactual explanation (CE) and causal recourse (CR), and (3) ICR's robustness to model re-fitting, and (4) cost of interventions in all methods. Our findings largely corroborate the original claims, with ICR demonstrating superior performance in guiding towards actual improvements and maintaining stable acceptance rates despite model re-fitting, a notable advantage over CE and CR methods. While we observe minor numerical discrepancies in results, the overall trends align with the original study, reinforcing the efficacy of ICR in enhancing both the explainability and equity of automated decision systems. This reproducibility study not only confirms the original findings but also highlights the importance of robust and practical approaches in algorithmic recourse for real-world applications.

URL: https://openreview.net/forum?id=Lneu1n5k1g

---

Title: Reproducibility Study Of Learning Fair Graph Representations Via Automated Data Augmentations

Abstract: In this study, we undertake a reproducibility analysis of "Learning Fair Graph Representations Via Automated Data Augmentations" by Ling et al. (2022). We assess the validity of the original claims centered around node classification tasks and explore the performance of the Graphair framework in link prediction tasks. Our investigation reveals that while we can partially reproduce some of the original claims—likely impeded by unstable training and a code bug identified through collaboration with the original authors—we fully substantiate another claim. Additionally, we broaden the application of Graphair from node classification to link prediction across various datasets. This expansion demonstrates Graphair’s superior performance in fairness metrics when compared to existing models, showing only a slight reduction in accuracy. This underlines Graphair’s potential applicability in a wider array of graph-based learning contexts, showcasing its capability to maintain high fairness standards without significantly compromising accuracy. Our code base can be found on GitHub https://anonymous.4open.science/r/Reproducibility-Study-Of-Graphair-1DB6.

URL: https://openreview.net/forum?id=4WiqHopXQX

---

Title: Re: Data Poisoning Attacks Against Multimodal Encoders

Abstract: Multimodal models, which leverage both visual and linguistic modalities, have gained increasing attention in recent years. However, these models are often trained on large-scale unlabeled datasets, which expose them to the risk of data poisoning attacks. An adversary can manipulate the training data to induce malicious behaviors in the model under certain conditions. Yang et al. (2023) recently conducted a study on the susceptibility of multimodal models to poisoning attacks. They introduced three types of poisoning attacks targeted at multimodal models, along with two potential defenses. In this work, we replicate all three attack strategies. However, we observed that the effectiveness of the attack depends on the poisoning rate in relation to the quantity of samples in the targeted class, a factor that can potentially reduce the efficiency of the attack. Additionally, we replicated the ablation study, verified the consistency of their claims, and provided further experimentation to test them. Regarding the proposed defenses, we reproduced them and explained a flaw in the first defense. Furthermore, we propose a more practical setting for the second defense.

URL: https://openreview.net/forum?id=5GVSTT3DNr

---

Title: Reproducibility study of "Robust Fair Clustering: A Novel Fairness Attack and Defense Framework"

Abstract: This reproducibility study examines "Robust Fair Clustering: A Novel Fairness Attack and Defense Framework" by Chhabra et al. (2023), an innovative work in fair clustering algorithms. Our study focuses on validating the original paper's claims concerning the susceptibility of state-of-the-art fair clustering models to adversarial attacks and the efficacy of the proposed Consensus Fair Clustering (CFC) defence mechanism. We employ a similar experimental framework but extend our investigations by using additional datasets. Our findings confirm the original paper's claims, reinforcing the vulnerability of fair clustering models to adversarial attacks and the robustness of the CFC mechanism.

URL: https://openreview.net/forum?id=Xu1sEPhjqH

---

Title: Reproducibility study of "Fair attribute completion on graph with missing attributes"

Abstract: Tackling unfairness is a challenging task with extensive difficulties in the context of graph
learning models. One of the major issues is posed by the absence of node attributes, due to
missing data or privacy concerns. A recent work by Guo et al. (2023) titled "Fair attribute
completion on a graph with missing attributes", tackles this problem by introducing FairAC.
The framework’s main components adopt state-of-the-art approaches, including a sensitive
discriminator and an attention mechanism to provide a solution to both the unfairness and
attribute completion problem. Supported by an experimental analysis, FairAC claims to
exhibit superior fairness performance while achieving similar node classification performance
compared to other baseline methods. In our work, we try to reproduce the results provided
by the authors along with validating their main claims. On top of that, this analysis
highlights FairAC’s ability to handle graphs with varying sparsity and fill missing attributes,
even in cases of limited neighbouring data.

URL: https://openreview.net/forum?id=bhgWubrkc9

---

Title: Solving the Tree Containment Problem Using Graph Neural Networks

Abstract: \textsc{Tree containment} is a fundamental problem in phylogenetics useful for verifying a proposed phylogenetic network, representing the evolutionary history of certain species. \textsc{Tree containment} asks whether the given phylogenetic tree (for instance, constructed from a DNA fragment showing tree-like evolution) is contained in the given phylogenetic network. In the general case, this is an NP-complete problem. We propose to solve it approximately using Graph Neural Networks. In particular, we propose to combine the given network and the tree and apply a Graph Neural Network to this network-tree graph. This way, we achieve the capability of solving the tree containment instances representing a larger number of species than the instances contained in the training dataset (i.e., our algorithm has the inductive learning ability). Our algorithm demonstrates an accuracy of over $95\%$ in solving the tree containment problem on instances with up to 100 leaves.

URL: https://openreview.net/forum?id=nK5MazeIpn

---

Title: [Re] Classwise-Shapley values for data valuation

Abstract: We evaluate CS-Shapley, a data valuation method introduced in Schoch et al. (2022) for classification problems. We repeat the experiments in the paper, including two additional methods, the Least Core (Yan & Procaccia, 2021) and Data Banzhaf (Wang & Jia, 2023), a comparison not found in the literature. We include more conservative error estimates and additional metrics, like rank stability, and a variance-corrected version of Weighted Accuracy Drop, originally introduced in Schoch et al. (2022). We conclude that while CS-Shapley helps in the scenarios it was originally tested in, in particular for the detection of corrupted labels, it is outperformed by the conceptually simpler Data Banzhaf in the task of detecting highly influential points.

URL: https://openreview.net/forum?id=srFEYJkqD7

---

Title: Reproducibility Study of "Learning Perturbations to Explain Time Series Predictions"

Abstract: In this work, we attempt to reproduce the results of "Reproducibility Study of "Learning Perturbations to Explain Time Series Predictions", which introduced ExtremalMask, a mask-based perturbation method for explaining time series data. We investigated the key claims of the this paper, namely that (1) the model outperformed other models in several key metrics on both synthetic and real data, and (2) the model performed better when using the loss function of the preservation game relative to that of the deletion game. Although discrepancies exist, our results generally support the core of the original paper's conclusions. Next, we interpret ExtremalMask's outputs using new visualizations and metrics and discuss the insights each interpretation provides. Finally, we test whether ExtremalMask create out of distribution samples, and found the model does not exhibit this flaw on our tested synthetic dataset. Overall, our results support and add nuance to the original paper's findings.

URL: https://openreview.net/forum?id=fCNqD2IuoD

---

Title: On the Reproducibility of: "Learning Perturbations to Explain Time Series Predictions"

Abstract: Deep Learning models have taken the front stage in the AI community, yet explainability challenges hinder their widespread adoption. Time series models, in particular, lack attention in this regard. This study tries to reproduce and extend the work of Enguehard (2023b), focusing on time series explainability by incorporating learnable masks and perturbations. Enguehard (2023b) employed two methods to learn these masks and perturbations, the preservation game (yielding SOTA results) and the deletion game (with poor performance). We extend the work by revising the deletion game’s loss function, testing the robustness of the proposed method on a novel weather dataset, and visualizing the learned masks and perturbations. Despite notable discrepancies in results across many experiments, our findings demonstrate that the proposed method consistently outperforms all baselines and exhibits robust performance across datasets. However, visualizations for the preservation game reveal that the learned perturbations primarily resemble a constant zero signal, questioning the importance of learning perturbations. Nevertheless, our revised deletion game shows promise, recovering meaningful perturbations and, in certain instances, surpassing the performance of the preservation game.

URL: https://openreview.net/forum?id=nPZgtpfgIx

---

Title: Directional Convergence Near Small Initializations and Saddles in Two-Homogeneous Neural Networks

Abstract: This paper examines gradient flow dynamics of two-homogeneous neural networks for small initializations, where all weights are initialized near the origin. For both square and logistic losses, it is shown that for sufficiently small initializations, the gradient flow dynamics spend sufficient time in the neighborhood of the origin to allow the weights of the neural network to approximately converge in direction to the Karush-Kuhn-Tucker (KKT) points of a neural correlation function that quantifies the correlation between the output of the neural network and corresponding labels in the training data set. For square loss, it has been observed that neural networks undergo saddle-to-saddle dynamics when initialized close to the origin. Motivated by this, this paper also shows a similar directional convergence among weights of small magnitude in the neighborhood of certain saddle points.

URL: https://openreview.net/forum?id=hfrPag75Y0

---

Title: Reproducibility study of FairAC

Abstract: This work aims to reproduce the findings of the paper "Fair Attribute Completion on Graph
with Missing Attributes" written by Guo et al. (2023) by investigating the claims made in
the paper. This paper suggests that the results of the original paper are reproducible and
thus, the claims hold. However, the claim that FairAC is a generic framework for many
downstream tasks is very broad and could therefore only be partially tested. Moreover,
we show that FairAC is generalizable to various datasets and sensitive attributes and show
evidence that the improvement in group fairness of the FairAC framework does not come at
the expense of individual fairness. Lastly, the codebase of FairAC has been refactored and
is now easily applicable for various datasets and models.

URL: https://openreview.net/forum?id=ccDi5jtSF7

---

Title: A Replication Study of Transfer Learning with Informative Priors: Simple Baselines Better than Previously Reported

Abstract: We pursue transfer learning to improve classifier accuracy on a target task with few labeled examples available for training. Recent work suggests that using a source task to learn a prior distribution over neural net weights, not just an initialization, can boost target task performance. We perform a replication study with careful hyperparameter tuning of all methods on every dataset. We find that standard transfer learning informed by an initialization only performs far better than reported in previous comparisons. The relative gains of methods using informative priors over standard transfer learning vary in magnitude across 5 total datasets. For the scenario of 5-300 examples per class, we find negative or neglible gains on 2 datasets, modest gains (between 1.5-3 points of accuracy) on 2 other datasets, and substantial gains (>8 points) on one dataset. Among methods using informative priors, we find that an isotropic covariance appears competitive with learned low-rank covariance matrix while being substantially simpler to understand and tune. Further analysis suggests that the mechanistic justification for informed priors -- hypothesized improved alignment between train and test loss landscapes -- is not consistently supported due to high variability in empirical landscapes. We release code to allow independent reproduction of all experiments.

URL: https://openreview.net/forum?id=BbvSU02jLg

---

Title: [Re] Learning Fair Graph Representations via Automated Data Augmentations

Abstract: We evaluate the reproducibility of the paper "Learning Fair Graph Representations via Automated Data Augmentations" by Ling et al. (2023). Our objective is to reproduce the three major claims that (1) fair augmentations improve fairness while retaining similar accuracy compared to other fairness methods, (2) augmenting both edges and node features performs better than augmenting only one of the two, and (3) learned augmentations reduce node-wise sensitive homophily and correlation between node features and the sensitive attribute. The authors provide an implementation of their method in PyTorch. We use and extend the given code, implementing an additional multi-run evaluation protocol with different random seeds. We further create additional baselines by disabling fairness in the model and investigating the generalizability of the method to other graph neural network (GNN) architectures and graphs with varying homophily. We partially reproduce claims (1), (2), and (3), attaining similar performance for two out of the three datasets originally used, as well as noisy results for the third dataset. Additionally, in our work, the correlation between node features and the sensitive attribute does not drop as significantly as in the original paper. On the other hand, we find that the method generalizes to other GNN structures yet does not generalize to graphs with varying homophily, failing for unbalanced homophily settings. Overall, the outcomes of the experiments indicate a lack of stability in the Graphair framework.

URL: https://openreview.net/forum?id=9OlU865dAF

---

Title: Parameter-efficient Multi-Task and Multi-Domain Learning using Factorized Tensor Networks

Abstract: Multi-task and multi-domain learning methods seek to learn multiple tasks/domains, jointly or one after another, using a single unified network. The primary challenge and opportunity lie in leveraging shared information across these tasks and domains to enhance the efficiency of the unified network. The efficiency can be in terms of accuracy, storage cost, computation, or sample complexity. In this paper, we introduce a factorized tensor network (FTN) designed to achieve accuracy comparable to that of independent single-task or single-domain networks, while introducing a minimal number of additional parameters. The FTN approach entails incorporating task- or domain-specific low-rank tensor factors into a shared frozen network derived from a source model. This strategy allows for adaptation to numerous target domains and tasks without encountering catastrophic forgetting. Furthermore, FTN requires a significantly smaller number of task-specific parameters compared to existing methods. We performed experiments on widely used multi-domain and multi-task datasets. We show the experiments on convolutional-based architecture with different backbones and on transformer-based architecture. Our findings indicate that FTN attains similar accuracy as single-task or single-domain methods while using only a fraction of additional parameters per task.

URL: https://openreview.net/forum?id=y3rXu3kPLQ

---

Title: A Large-Scale 3D Face Mesh Video Dataset via Neural Re-parameterized Optimization

Abstract: We propose NeuFace, a 3D face mesh pseudo annotation method on videos via neural re-parameterized optimization. Despite the huge progress in 3D face reconstruction methods, generating reliable 3D face labels for in-the-wild dynamic videos remains challenging. Using NeuFace optimization, we annotate the per-view/-frame accurate and consistent face meshes on large-scale face videos, called the NeuFace-dataset. We investigate how neural re-parameterization helps to reconstruct image-aligned facial details on 3D meshes via gradient analysis. By exploiting the naturalness and diversity of 3D faces in our dataset, we demonstrate the usefulness of our dataset for 3D face-related tasks: improving the reconstruction accuracy of an existing 3D face reconstruction model and learning 3D facial motion prior. Code and datasets will be publicly available if accepted.

URL: https://openreview.net/forum?id=zVDMh6JvWc

---

Title: 'Explaining RL Decisions with Trajectories’: A Reproducibility Study

Abstract: This work investigates the reproducibility of the paper "Explaining RL decisions with trajectories“ by Deshmukh et al. (2023). The original paper introduces a novel approach in explainable reinforcement learning based on the attribution decisions of an agent to specific clusters of trajectories encountered during training. We verify the main claims from the paper, which state that (i) training on less trajectories induces a lower initial state value, (ii) trajectories in a cluster present similar high-level patterns, (iii) distant trajectories influence the decision of an agent, and (iv) humans correctly identify the attributed trajectories to the decision of the agent. We recover the environments used by the authors based on the partial original code they provided for one of the environments (Grid-World), and implemented the remaining from scratch (Seaquest and HalfCheetah, Breakout, Q*Bert).
While we confirm that (i), (ii), and (iii) partially hold, we extend on the largely qualitative experiments from the authors by introducing a quantitative metric to further support (iii), and new experiments and visual results for (i). Moreover, we investigate the use of different clustering algorithms and encoder architectures to further support (ii). We could not support (iv), given the limited extent of the original experiments. We conclude that, while some of the claims can be supported, further investigations and experiments could be of interest. We recognize the novelty of the work from the authors and hope that our work paves the way for clearer and more transparent approaches.

URL: https://openreview.net/forum?id=QdeBbK5CSh

---

Title: XPL: A Cross-Model framework for Semi-Supervised Prompt Learning in Vision-Language Models

Abstract: Prompt learning, which focuses on learning soft prompts, has emerged as a promising approach for
efficiently adapting pretrained vision-language models (VLMs) to multiple downstream tasks. While prior works have shown promising performances on common benchmarks, they typically rely on labeled data samples only. This greatly discredits the information gain from the vast collection of otherwise unlabeled samples available in the wild. To mitigate this, we propose a simple yet efficient cross-model framework to leverage on the unlabeled samples achieving significant gain in model performance. Specifically, we employ a semi-supervised prompt learning approach which makes the learned prompts invariant to the different views of a given unlabeled sample. The multiple views are obtained using different augmentations on the images as well as by varying the lengths of visual and text prompts attached to these samples. Experimenting with this simple yet surprisingly effective approach over a large number of benchmark datasets, we observe a considerable improvement in the quality of soft prompts thereby making an immense gain in image classification performance. Interestingly, our approach also benefits from out-of-domain unlabeled images highlighting the robustness and generalization capabilities. Our code will be made publicly available.

URL: https://openreview.net/forum?id=oxAZv3QD6M

---

Title: Mini-Batch Optimization of Contrastive Loss

Abstract: Contrastive learning has gained significant attention as a pre-training method for self-supervised learning due to its ability to leverage large amounts of unlabeled data. A contrastive loss function ensures that embeddings of positive sample pairs (e.g., from the same class or different views of the same data) are similar, while embeddings of negative pairs are dissimilar. However, practical constraints such as large memory requirements make it infeasible to consider all possible positive and negative pairs, leading to the use of mini-batches. In this paper, we investigate the theoretical aspects of mini-batch optimization in contrastive learning with the InfoNCE loss. We show that mini-batch optimization is equivalent to full-batch optimization if and only if all $\binom{N}{B}$ mini-batches are selected, while sub-optimality may arise when examining only a subset. We then demonstrate that utilizing high-loss mini-batches can speed up SGD convergence and propose a spectral clustering-based approach for identifying these high-loss mini-batches. Our experimental results validate our theoretical findings and demonstrate that our proposed algorithm outperforms vanilla SGD, providing a better understanding of mini-batch optimization in contrastive learning.

URL: https://openreview.net/forum?id=Nux7OVXpJ9

---

Title: Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion

Abstract: This paper investigates the ability of transformer-based models to learn structural recursion from examples. Recursion is a universal concept in both natural and formal languages. Structural recursion is central to the programming language and formal mathematics tasks where symbolic tools currently excel beyond neural models, such as inferring semantic relations between datatypes and emulating program behavior.
We introduce a general framework that nicely connects the abstract concepts of structural recursion in the programming language domain to concrete sequence modeling problems and learned models' behavior. The framework includes a representation that captures the general \textit{syntax} of structural recursion, coupled with two different frameworks for understanding their \textit{semantics}---one that is more natural from a programming languages perspective and one that helps bridge that perspective
with a mechanistic understanding of the underlying transformer architecture.

With our framework as a powerful conceptual tool, we identify different issues under various set-ups. The models trained to emulate recursive computations cannot fully capture the recursion yet instead fit short-cut algorithms and thus cannot solve certain edge cases that are under-represented in the training distribution. In addition, it is difficult for state-of-the-art large language models (LLMs) to mine recursive rules from in-context demonstrations. Meanwhile, these LLMs fail in interesting ways when emulating reduction (step-wise computation) of the recursive function.

URL: https://openreview.net/forum?id=Ry5CXXm1sf

---

Title: CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Abstract: This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset samples are available at Supplementary Materials.

URL: https://openreview.net/forum?id=mKtlzW0bWc

---

Title: Hyperbolic Random Forests

Abstract: Hyperbolic space is becoming a popular choice for representing data due to the hierarchical structure - whether implicit or explicit - of many real-world datasets. Along with it comes a need for algorithms capable of solving fundamental tasks, such as classification, in hyperbolic space.
Recently, multiple papers have investigated hyperbolic alternatives to hyperplane-based classifiers, such as logistic regression and SVMs. While effective, these approaches struggle with more complex hierarchical data. We, therefore, propose to generalize the well-known random forests to hyperbolic space.
We do this by redefining the notion of a split using horospheres. Since finding the globally optimal split is computationally intractable, we find candidate horospheres through a large-margin classifier. To make hyperbolic random forests work on multi-class data and imbalanced experiments, we furthermore outline new methods for combining classes based on the lowest common ancestor and class-balanced large-margin losses. Experiments on standard and new benchmarks show that our approach outperforms both conventional random forest algorithms and recent hyperbolic classifiers.

URL: https://openreview.net/forum?id=pjKcIzvXWR

---

Title: Improving Black-box Robustness with In-Context Rewriting

Abstract: Machine learning models often excel on in-distribution (ID) data but struggle with unseen out-of-distribution (OOD) inputs. Most techniques for improving OOD robustness are not applicable to settings where the model is effectively a black box, such as when the weights are frozen, retraining is costly, or the model is leveraged via an API. Test-time augmentation (TTA) is a simple post-hoc technique for improving robustness that sidesteps black-box constraints by aggregating predictions across multiple augmentations of the test input. TTA has seen limited use in NLP due to the challenge of generating effective natural language augmentations. In this work, we propose LLM-TTA, which uses LLM-generated augmentations as TTA's augmentation function. LLM-TTA outperforms conventional augmentation functions across sentiment, toxicity, and news classification tasks for BERT and T5 models, with BERT's OOD robustness improving by an average of 4.30 percentage points without regressing average ID performance. We explore selectively augmenting inputs based on prediction entropy to reduce the rate of expensive LLM augmentations, allowing us to maintain performance gains while reducing the average number of generated augmentations by 57.76%. LLM-TTA is agnostic to the task model architecture, does not require OOD labels, and is effective across low and high-resource settings.

URL: https://openreview.net/forum?id=e92dgUUfk0

---

Title: Language Models Are Better Than Humans at Next-token Prediction

Abstract: Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, causal language models are not trained to perform well at these tasks; they are trained to accurately predict the next token given previous tokens in tokenized text. It is not clear whether language models are better or worse than humans at next-token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity on OpenWebText. In both experiments, we find humans to be consistently \emph{worse} than relatively small language models like GPT-Neo-1.3B or GPT-2-large at next-token prediction.

URL: https://openreview.net/forum?id=RNsnSLdmV7

---

Title: Contextual Vision Transformers for Robust Representation Learning

Abstract: We introduce Contextual Vision Transformers (ContextViT), a method designed to generate robust image representations for datasets experiencing shifts in latent factors across various groups. Derived from the concept of in-context learning, ContextViT incorporates an additional context token to encapsulate group-specific information. This integration allows the model to adjust the image representation in accordance with the group-specific context. Specifically, for a given input image, ContextViT maps images with identical group membership into this context token, which is appended to the input image tokens. Additionally, we introduce a context inference network to predict such tokens on-the-fly, given a batch of samples from the group. This enables ContextViT to adapt to new testing distributions during inference time. We demonstrate the efficacy of ContextViT across a wide range of applications. In supervised fine-tuning, we show that augmenting pre-trained ViTs with our proposed context conditioning mechanism results in consistent improvements in out-of-distribution generalization on iWildCam and FMoW. We also investigate self-supervised representation learning with ContextViT. Our experiments on the Camelyon17 pathology imaging benchmark and the JUMP-CP microscopy imaging benchmark demonstrate that ContextViT excels in learning stable image featurizations amidst distribution shift, consistently outperforming its ViT counterpart.

URL: https://openreview.net/forum?id=zg8EHhNJVp

---

Title: Textless Low-Resource Speech-to-Speech Translation With Unit Language Models

Abstract: Existing speech-to-speech translation models fall into two camps: textless models trained with hundreds of hours of parallel speech data or unsupervised models that leverage text as an intermediate step. Both approaches limit building speech-to-speech translation models for a wide range of languages, as they exclude languages that are primarily spoken and language pairs that lack large-scale parallel speech data. We present a new framework for training textless low-resource speech-to-speech translation (S2ST) systems that only need dozens of hours of parallel speech data. We reformulate S2ST as a unit-to-unit seq2seq translation task, and start by pretraining a model on large-scale monolingual speech data. Then, we finetune it with a small amount of parallel speech data ($20-60$ hours). Lastly, we improve model performance through an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech data. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our supervised topline.

URL: https://openreview.net/forum?id=zTNVjQXZyx

---

Title: [Re] Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts

Abstract: In this work, we aim to reproduce the EMNLP 2023 paper of Liu et al. (2023) titled Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts. In the original paper, XoT - a prompting method which is designed to enhance LLMs' ability to deal with mathematical problems - is introduced and tested using GPT-3.5. The experiments showed that the technique outperforms the existing ones. We seek not only to ascertain the effectiveness of the method for a smaller model, the Phi-2, but also to expand the ideas of the paper by integrating metacognitive evaluation and broadening one of its modules.

URL: https://openreview.net/forum?id=k74hPZxIAz

---

Title: Fair Feature Importance Scores for Interpreting Decision Trees

Abstract: Across various sectors such as healthcare, criminal justice, national security, finance, and technology, large-scale machine learning (ML) systems are being deployed to make critical data-driven decisions. Many have asked if we can and should trust these ML systems to be making these decisions. Two critical components are prerequisites for trust in ML systems: interpretability, or the ability to understand why the ML system makes the decisions it does, and fairness, which ensures that ML systems do not exhibit bias against certain individuals or groups. While both interpretability and fairness have garnered substantial attention in the ML literature, methods directly interpreting models in terms of fairness remain limited. This paper considers a popular interpretation for a widely used class of ML models: feature importance scores for decision trees and tree-based models. We introduce a novel Fair Tree Feature Importance Score to assess each feature's impact on fairness or bias in decision trees. Analogous to the mean decrease in impurity for trees, our score quantifies the mean increase (or decrease) in group bias, and extends to interpret tree-based ensembles or surrogates of complex ML systems. Through simulations and real examples on benchmark fairness datasets, we show the validity of our Fair Tree Feature Importance Score, offering meaningful interpretations for both tree-based ensembles and tree-based surrogates of other ML systems.

URL: https://openreview.net/forum?id=72mDxlzRZ1

---

Title: Intriguing Properties of Modern GANs

Abstract: Modern GANs achieve remarkable performance in terms of generating realistic and diverse samples. This has led many to believe that "GANs capture the training data manifold". In this work we show that this interpretation is wrong. We empirically show that the manifold learned by modern GANs does not fit the training distribution: specifically the manifold does not pass through the training examples and passes closer to out-of-distribution images than to in-distribution images. We also investigate the distribution over images implied by the prior over the latent codes and study whether modern GANs learn a density that approximates the training distribution. Surprisingly, we find that the learned density is very far from the data distribution and that GANs tend to assign higher density to out-of-distribution images. Finally, we demonstrate that the set of images used to train modern GANs are often not part of the typical set described by the GANs' distribution.

URL: https://openreview.net/forum?id=XCZAokQ0c8

---

Title: Combine and Conquer: A Meta-Analysis on Data Shift and Out-of-Distribution Detection

Abstract: This paper introduces a universal approach to seamlessly combine out-of-distribution (OOD) detection scores. These scores creatively encompass a wide range of techniques that leverage the self-confidence of deep learning models and the anomalous behavior of features in the latent space. Not surprisingly, combining such a varied population using simple statistics proves inadequate. To overcome this challenge, we propose a quantile normalization to map these scores into p-values, effectively framing the problem into a multi-variate hypothesis test. Then, we combine these tests using established meta-analysis tools, resulting in a more effective detector with consolidated decision boundaries. Furthermore, we create a probabilistic interpretable criterion by mapping the final statistics into a distribution with known parameters. Through empirical investigation, we explore different types of shifts, each exerting varying degrees of impact on data. Our results demonstrate that our approach significantly improves overall robustness and performance across diverse OOD detection scenarios. Notably, our framework is easily extensible for future developments in detection scores and stands as the first to combine decision boundaries in this context.

URL: https://openreview.net/forum?id=VGNBUS9TrU

---

Title: Object-Centric Relational Representations for Image Generation

Abstract: Conditioning image generation on specific features of the desired output is a key ingredient of modern generative models. However, existing approaches lack a general and unified way of representing structural and semantic conditioning at diverse granularity levels. This paper explores a novel method to condition image generation, based on object-centric relational representations. In particular, we propose a methodology to condition the generation of objects in an image on the attributed graph representing their structure and the associated semantic information. We show that such architectural biases entail properties that facilitate the manipulation and conditioning of the generative process and allow for regularizing the training procedure. The proposed conditioning framework is implemented by means of a neural network that learns to generate a 2D, multi-channel, layout mask of the objects, which can be used as a soft inductive bias in the downstream generative task. To do so, we leverage both 2D and graph convolutional operators. We also propose a novel benchmark for image generation consisting of a synthetic dataset of images paired with their relational representation. Empirical results show that the proposed approach compares favorably against relevant baselines.

URL: https://openreview.net/forum?id=7kWjB9zW90

---

Title: Appropriate Balance of Diversification and Intensification Improves Performance and Efficiency of Adversarial Attacks

Abstract: Recently, adversarial attacks that generate adversarial examples by optimizing a multimodal function with many local optimums have attracted considerable research attention. Quick convergence to a nearby local optimum (intensification) and fast enumeration of multiple different local optima (diversification) are important to construct strong attacks. Most existing white-box attacks that use the model’s gradient enumerate multiple local optima based on multi-restart; however, our experiments suggest that the ability of diversification based on multi-restart is limited. To tackle this problem, we propose the multi-directions/objectives (MDO) strategy, which uses multiple search directions and objective functions for diversification. Efficient Diversified Attack, a combination of MDO and multi-target strategies, showed further diversification performance, resulting in better performance than recently proposed attacks against around 88% of 41 CNN-based robust models and 100% of 10 more advanced models, including transformer-based architecture. These results suggest a relationship between attack performances and a balance of diversification and intensification, which is beneficial to constructing more potent attacks.

URL: https://openreview.net/forum?id=mK6TwmInTg

---

Title: Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

Abstract: The evaluation of natural language processing (NLP) systems is crucial for advancing the field, but current benchmarking approaches often assume that all systems have scores available for all tasks, which is not always practical. In reality, several factors such as the cost of running baseline, private systems, computational limitations, or incomplete data may prevent some systems from being evaluated on entire tasks. This paper formalizes an existing problem in NLP research: benchmarking when some systems scores are missing on the task, and proposes a novel approach to address it. Our method utilizes a compatible partial ranking approach to impute missing data, which is then aggregated using the Borda count method. It includes two refinements designed specifically for scenarios where either task-level or instance-level scores are available. We also introduce an extended benchmark, which contains over 131 million scores, an order of magnitude larger than existing benchmarks. We validate our methods and demonstrate their effectiveness in addressing the challenge of missing system evaluation on an entire task. This work highlights the need for more comprehensive benchmarking approaches that can handle real-world scenarios where not all systems are evaluated on the entire task.

URL: https://openreview.net/forum?id=wNWcOMfNCn

---

Title: Choosing Public Datasets for Private Machine Learning via Gradient Subspace Distance

Abstract: Differentially private stochastic gradient descent privatizes model training by injecting noise into each iteration, where the noise magnitude increases with the number of model parameters. Recent works suggest that we can reduce the noise by leveraging public data for private machine learning, by projecting gradients onto a subspace prescribed by the public data. However, given a choice of public datasets, it is unclear why certain datasets perform better than others for a particular private task, or how to identify the best one. We provide a simple metric which measures a low-dimensional subspace distance between gradients of the public and private examples. We empirically demonstrate that it is well-correlated with resulting model utility when using the public and private dataset pair (i.e., trained model accuracy is monotone in the distance), and thus can be used to select an appropriate public dataset. We provide theoretical analysis demonstrating that the excess risk scales with this subspace distance. This distance is easy to compute and robust to modifications in the setting.

URL: https://openreview.net/forum?id=uLLpaiMTdD

---

Title: Bayesian learning of Causal Structure and Mechanisms with GFlowNets and Variational Bayes

Abstract: Bayesian causal structure learning aims to learn a posterior distribution over directed acyclic graphs (DAGs), and the mechanisms that define the relationship between parent and child variables. By taking a Bayesian approach, it is possible to reason about the uncertainty of the causal model. The notion of modelling the uncertainty over models is particularly crucial for causal structure learning since the model could be unidentifiable when given only a finite amount of observational data. In this paper, we introduce a novel method to jointly learn the structure and mechanisms of the causal model using Variational Bayes, which we call Variational Bayes-DAG-GFlowNet (VBG). We extend the method of Bayesian causal structure learning using GFlowNets to learn not only the posterior distribution over the structure, but also the parameters of a linear Gaussian model. Our results on simulated and real-world data suggest that VBG is competitive against several baselines in modelling the posterior over DAGs and mechanisms, while offering several advantages over existing methods which include guaranteed acyclicity of graphs and unlimited sampling from the posterior once the model is trained.

URL: https://openreview.net/forum?id=kF3KqzRTsi

---

Title: Reproducibility of "ITI-GEN: Inclusive Text-to-Image Generation"

Abstract: A major limitation of current text-to-image generation models is their inherent tendency to incorporate biases, thereby not demonstrating inclusivity in certain attributes. An approach to enhance the inclusiveness is suggested by \cite{zhang2023itigen}: Inclusive Text-to-Image Generation (ITI-GEN). The authors state that ITI-GEN leverages reference images to improve the inclusiveness of text-to-image generation by learning inclusive prompt embeddings for targeted attributes. In this paper, the reproducibility of ITI-GEN is investigated in an attempt to validate the main claims presented by the authors. Moreover, additional experiments are conducted to provide further evidence supporting their assertions and research their limitations. This concerns the research on inclusive prompt embeddings, the inclusivity of untargeted attributes and the influence of the reference images. The results from the reproducibility study mainly show support for their claims. The additional experiments reveal that ITI-GEN only guarantees inclusivity for the specified targeted attributes. To address this shortcoming, we present a possible solution, namely ensuring a balanced reference dataset.

URL: https://openreview.net/forum?id=GO7Jg4HSAA

---

Title: Incorporating Unlabelled Data into Bayesian Neural Networks

Abstract: Conventional Bayesian Neural Networks (BNNs) are unable to leverage unlabelled data to improve their predictions. To overcome this limitation, we introduce Self-Supervised Bayesian Neural Networks, which use unlabelled data to learn models with suitable prior predictive distributions. This is achieved by leveraging contrastive pretraining techniques and optimising a variational lower bound. We then show that the prior predictive distributions of self-supervised BNNs capture problem semantics better than conventional BNN priors. In turn, our approach offers improved predictive performance over conventional BNNs, especially in low-budget regimes.

URL: https://openreview.net/forum?id=q2AbLOwmHm

---

Title: Gradient Scarcity in Graph Learning with Bilevel Optimization

Abstract: Gradient scarcity emerges when learning graphs by minimizing a loss on a subset of nodes under the semi-supervised setting. It consists in edges between unlabeled nodes that are far from the labeled ones receiving zero gradients. The phenomenon was first described when jointly optimizing the graph and the weights of a shallow Graph Neural Network (GNN) using a single loss function. In this work, we give a precise mathematical characterization of this phenomenon, and prove that it also emerges in bilevel optimization. While for GNNs gradient scarcity occurs due to their finite receptive field, we show that it also occurs with the Laplacian regularization as gradients decrease exponentially in amplitude with distance to labeled nodes, despite the infinite receptive field of this model. We study several solutions to this issue including latent graph learning using a Graph-to-Graph model (G2G), graph regularization to impose a prior structure on the graph, and reducing the graph diameter by optimizing for a larger set of edges. Our empirical results validate our analysis and show that this issue also occurs with the Approximate Personalized Propagation of Neural Predictions (APPNP), which approximates a model of infinite receptive field.

URL: https://openreview.net/forum?id=10YJTIsVYq

---

Title: Deep Backtracking Counterfactuals for Causally Compliant Explanations

Abstract: Counterfactuals answer questions of what would have been observed under altered circumstances and can therefore offer valuable insights. Whereas the classical interventional interpretation of counterfactuals has been studied extensively, backtracking constitutes a less studied alternative where all causal laws are kept intact. In the present work, we introduce a practical method called deep backtracking counterfactuals (DeepBC) for computing backtracking counterfactuals in structural causal models that consist of deep generative components. We propose two distinct versions of our method—one utilizing Langevin Monte Carlo sampling and the other employing constrained optimization—to generate counterfactuals for high-dimensional data. As a special case, our formulation reduces to methods in the field of counterfactual explanations. Compared to these, our approach represents a causally compliant, versatile and modular alternative. We demonstrate these properties experimentally on a modified version of MNIST and CelebA.

URL: https://openreview.net/forum?id=Br5esc2CXR

---

Title: Universal Neurons in GPT2 Language Models

Abstract: A basic question within the emerging field of mechanistic interpretability is the degree to which neural networks learn the same underlying mechanisms. In other words, are neural mechanisms universal across different models?
In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neurons are likely to be interpretable. In particular, we compute pairwise correlations of neuron activations over 100 million tokens for every neuron pair across five different seeds and find that 1-5\% of neurons are universal, that is, pairs of neurons which consistently activate on the same inputs. We then study these universal neurons in detail, finding that they usually have clear interpretations and taxonomize them into a small number of neuron families. We conclude by studying patterns in neuron weights to establish several universal functional roles of neurons in simple circuits: deactivating attention heads, changing the entropy of the next token distribution, and predicting the next token to (not) be within a particular set.

URL: https://openreview.net/forum?id=ZeI104QZ8I

---

Title: Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation

Abstract: Data attribution methods trace model behavior back to its training dataset, offering an effective approach to better understand ``black-box" neural networks. While prior research has established quantifiable links between model output and training data in diverse settings, interpreting diffusion model outputs in relation to training samples remains underexplored. In particular, diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts, posing a significant challenge to extend existing frameworks to diffusion models directly. Notably, we present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep. This trend leads to a prominent bias in influence estimation, and is particularly noticeable for samples trained on large-norm-inducing timesteps, causing them to be generally influential. To mitigate this effect, we introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest, facilitating a localized measurement of influence and considerably more intuitive visualization. We demonstrate the efficacy of our approach through various evaluation metrics and auxiliary tasks, reducing the amount of generally influential samples to $\frac{1}{3}$ of its original quantity.

URL: https://openreview.net/forum?id=P3Lyun7CZs

---

Title: Towards Minimal Targeted Updates of Language Models with Targeted Negative Training

Abstract: Generative models of language exhibit impressive capabilities but still place non-negligible probability mass over undesirable outputs. In this work, we address the task of updating a model to avoid unwanted outputs while minimally changing model behavior otherwise, a challenge we refer to as a minimal targeted update. We first formalize the notion of a minimal targeted update and propose a method to achieve such updates using negative examples from a model's generations. Our proposed Targeted Negative Training (TNT) results in updates that keep the new distribution close to the original, unlike existing losses for negative signal which push down probability but do not control what the updated distribution will be. In experiments, we demonstrate that TNT yields a better trade-off between reducing unwanted behavior and maintaining model generation behavior than baselines, paving the way towards a modeling paradigm based on iterative training updates that constrain models from generating undesirable outputs while preserving their impressive capabilities.

URL: https://openreview.net/forum?id=lrZ2yiqOS2

---

Title: TRIDENT: The Nonlinear Trilogy for Implicit Neural Representations

Abstract: Implicit neural representations (INRs) have garnered significant interest recently for their ability to model complex, high-dimensional data without explicit parameterisation. In this work, we introduce TRIDENT, a novel function for implicit neural representations characterised by a trilogy of nonlinearities. Firstly, it is designed to represent high-order features through order compactness. Secondly, TRIDENT efficiently captures frequency information, a feature called frequency compactness. Thirdly, it has the capability to represent signals or images such that most of its energy is concentrated in a limited spatial region, denoting spatial compactness. We demonstrated through extensive experiments on various inverse problems that our proposed function outperforms existing implicit neural representation functions.

URL: https://openreview.net/forum?id=OmUNBXry91

---

Title: Ranking evaluation metrics from a group-theoretic perspective

Abstract: Confronted with the challenge of identifying the most suitable metric to validate the merits of newly proposed models, the decision-making process is anything but straightforward. Given that comparing rankings introduces its own set of formidable challenges and the likely absence of a universal metric applicable to all scenarios, the scenario does not get any better. Furthermore, metrics designed for specific contexts, such as for Recommender Systems, sometimes extend to other domains without a comprehensive grasp of their underlying mechanisms, resulting in unforeseen outcomes and potential misuses. Complicating matters further, distinct metrics may emphasize different aspects of rankings, frequently leading to seemingly contradictory comparisons of model results and hindering the trustworthiness of evaluations.

We unveil these aspects in the domain of ranking evaluation metrics. Firstly, we show instances resulting in inconsistent evaluations, sources of potential mistrust in commonly used metrics; by quantifying the frequency of such disagreements, we prove that these are common in rankings. Afterward, we conceptualize rankings using the mathematical formalism of symmetric groups detaching from possible domains where the metrics have been created; through this approach, we can rigorously and formally establish essential mathematical properties for ranking evaluation metrics, essential for a deeper comprehension of the source of inconsistent evaluations. We conclude with a discussion, connecting our theoretical analysis to the practical applications, highlighting which properties are important in each domain where rankings are commonly evaluated. In conclusion, our analysis sheds light on ranking evaluation metrics, highlighting that inconsistent evaluations should not be seen as a source of mistrust but as the need to carefully choose how to evaluate our models in the future.

URL: https://openreview.net/forum?id=8jJG59Zq1e

---

Title: Approximations to the Fisher Information Metric of Deep Generative Models for Out-Of-Distribution Detection

Abstract: Likelihood-based deep generative models such as score-based diffusion models and variational
autoencoders are state-of-the-art machine learning models approximating high-dimensional
distributions of data such as images, text, or audio. One of many downstream tasks they
can be naturally applied to is out-of-distribution (OOD) detection. However, seminal work
by Nalisnick et al. which we reproduce showed that deep generative models consistently
infer higher log-likelihoods for OOD data than data they were trained on, marking an open
problem. In this work, we analyse using the gradient of a data point with respect to the
parameters of the deep generative model for OOD detection, based on the simple intuition
that OOD data should have larger gradient norms than training data. We formalise measuring
the size of the gradient as approximating the Fisher information metric. We show that the
Fisher information matrix (FIM) has large absolute diagonal values, motivating the use of
chi-square distributed, layer-wise gradient norms as features. We combine these features to
make a simple, model-agnostic and hyperparameter-free method for OOD detection which
estimates the joint density of the layer-wise gradient norms for a given data point. We
find that these layer-wise gradient norms are weakly correlated, rendering their combined
usage informative, and prove that the layer-wise gradient norms satisfy the principle of (data
representation) invariance. Our empirical results indicate that this method outperforms the
Typicality test for most deep generative models and image dataset pairings.

URL: https://openreview.net/forum?id=EcuwtinFs9

---

Title: Bayesian Extreme Learning

Abstract: This paper introduces a Bayesian extreme learning (BEL) model for analyzing high di- mensional datasets characterized by extreme values. The model synthesizes elements from information theory, Bayesian inference, machine learning, and extreme value theory. Conver- gence properties of the BEL model are established by declining Kullback-Leibler divergence between consecutive posterior distributions as the sample size grows. The model’s capa- bility to isolate extreme values is demonstrated by increasing entropy. Additionally, the paper validates the regularization optimality, where the optimal parameter configuration effectively minimizes the divergence from a specified reference distribution. The paper also shows the model’s proficiency in achieving near-optimal information extraction and its uni- versal approximation ability for continuous extreme value distributions across a range of tolerance levels. The model’s robustness and versatility are illustrated through examples, simulations, and applications, underscoring its potential utility in statistical learning within high-dimensional datasets.

URL: https://openreview.net/forum?id=kLuUqT4BnO

---

Title: Variational Bayesian Imaging with an Efficient Surrogate Score-based Prior

Abstract: We propose a surrogate function for efficient yet principled use of score-based priors in Bayesian imaging. Recent work turned score-based diffusion models into principled priors for solving ill-posed imaging problems by appealing to an ODE-based log-probability function. However, evaluating the ODE is computationally inefficient and inhibits posterior estimation of high-dimensional images. Our proposed surrogate prior is based on the evidence lower bound of a score-based diffusion model. We demonstrate the surrogate prior on variational inference for efficient approximate posterior sampling of large images. Compared to the exact prior in previous work, our surrogate accelerates optimization of the variational image distribution by at least two orders of magnitude. We also find that our principled approach gives more-accurate posterior estimation than non-variational diffusion-based approaches that involve hyperparameter-tuning at inference. Our work establishes a practical path forward for using score-based diffusion models as general-purpose image priors.

URL: https://openreview.net/forum?id=db2pFKVcm1

---

Title: Revisiting Non-separable Binary Classification and its Applications in Anomaly Detection

Abstract: The inability to linearly classify $\texttt{XOR}$ has motivated much of deep learning.
We revisit this age-old problem and show that $\textit{linear}$ classification of $\texttt{XOR}$ is indeed possible.
Instead of separating data between halfspaces, we propose a slightly different paradigm, $\texttt{equality separation}$, that adapts the SVM objective to distinguish data within or outside the margin.
Our classifier can then be integrated into neural network pipelines with a smooth approximation.
From its properties, we intuit that equality separation is suitable for anomaly detection.
To formalize this notion, we introduce $\textit{closing numbers}$, a quantitative measure on the capacity for classifiers to form closed decision regions for anomaly detection.
Springboarding from this theoretical connection between binary classification and anomaly detection, we test our hypothesis on supervised anomaly detection experiments, showing that equality separation can detect both seen and unseen anomalies.

URL: https://openreview.net/forum?id=zOJ846BXhl

---

Title: Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

Abstract: We study the loss landscape of both shallow and deep, mildly overparameterized ReLU neural networks on a generic finite input dataset for the squared error loss. We show both by count and volume that most activation patterns correspond to parameter regions with no bad local minima. Furthermore, for one-dimensional input data, we show most activation regions realizable by the network contain a high dimensional set of global minima and no bad local minima. We experimentally confirm these results by finding a phase transition from most regions having full rank Jacobian to many regions having deficient rank depending on the amount of overparameterization.

URL: https://openreview.net/forum?id=10WARaIwFn

---

Title: Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations

Abstract: Learning with reduced labeling standards, such as noisy label, partial label, and supplementary unlabeled data, which we generically refer to as imprecise label, is a commonplace challenge in machine learning tasks. Previous methods tend to propose specific designs for every emerging imprecise label configuration, which is usually unsustainable when multiple configurations of imprecision coexist.
In this paper, we introduce imprecise label learning (ILL), a framework for the unification of learning with various imprecise label configurations. ILL leverages expectation-maximization (EM) for modeling the imprecise label information, treating the precise labels as latent variables. Instead of approximating the correct labels for training, it considers the entire distribution of all possible labeling entailed by the imprecise information. We demonstrate that ILL can seamlessly adapt to partial label learning, semi-supervised learning, noisy label learning, and, more importantly, a mixture of these settings, with closed-form learning objectives derived from the unified EM modeling. Notably, ILL surpasses the existing specified techniques for handling imprecise labels, marking the first practical and unified framework with robust and effective performance across various challenging settings. We hope our work will inspire further research on this topic, unleashing the full potential of ILL in wider scenarios where precise labels are expensive and complicated to obtain.

URL: https://openreview.net/forum?id=1LRrnGu7fA

---

Title: Pushing the Limits of Gradient Descent for Efficient Learning on Large Images

Abstract: Traditional CNN models are trained and tested on relatively low resolution images ($<300$ px), and cannot be directly operated on large-scale images due to compute and memory constraints. We propose Patch Gradient Descent (PatchGD), an effective learning strategy that allows to train the existing CNN architectures on large-scale images in an end-to-end manner. PatchGD is based on the hypothesis that instead of performing gradient-based updates on an entire image at once, it should be possible to achieve a good solution by performing model updates on only small parts of the image at a time, ensuring that the majority of it is covered over the course of iterations. PatchGD thus extensively enjoys better memory and compute efficiency when training models on large scale images. PatchGD is thoroughly evaluated on two datasets - PANDA and UltraMNIST with ResNet50 and MobileNetV2 models under different memory constraints. Our evaluation clearly shows that PatchGD is much more stable and efficient than the standard gradient-descent method in handling large images, and especially when the compute memory is limited.

URL: https://openreview.net/forum?id=6dS1jhdemD

---

Title: A Lennard-Jones Layer for Distribution Normalization

Abstract: We introduce the Lennard-Jones layer (LJL) for the equalization of the density of 2D and 3D point clouds through systematically rearranging points without destroying their overall structure (distribution normalization). LJL simulates a dissipative process of repulsive and weakly attractive interactions between individual points by considering the nearest neighbor of each point at a given moment in time. This pushes the particles into a potential valley, reaching a well-defined stable configuration that approximates an equidistant sampling after the stabilization process. We apply LJLs to redistribute randomly generated point clouds into a randomized uniform distribution. Moreover, LJLs are embedded in the generation process of point cloud networks by adding them at later stages of the inference process. The improvements in 3D point cloud generation utilizing LJLs are evaluated qualitatively and quantitatively. Finally, we apply LJLs to improve the point distribution of a score-based 3D point cloud denoising network. In general, we demonstrate that LJLs are effective for distribution normalization which can be applied at negligible cost without retraining the given neural network.

URL: https://openreview.net/forum?id=imGl7xItqQ

---

Title: End-to-End Training Induces Information Bottleneck through Layer-Role Differentiation: A Comparative Analysis with Layer-wise Training

Abstract: End-to-end (E2E) training, optimizing the entire model through error backpropagation, fundamentally supports the advancements of deep learning. Despite its high performance, E2E training faces the problems of memory consumption, parallel computing, and discrepancy with the functionalities of the actual brain. Various alternative methods have been proposed to overcome these difficulties; however, no one can yet match the performance of E2E training, thereby falling short in practicality. Furthermore, there is no deep understanding regarding differences in the trained model properties beyond the performance gap.
In this paper, we reconsider why E2E training demonstrates a superior performance through a comparison with layer-wise training, which shares fundamental learning principles and architectures with E2E training, with the granularity of loss evaluation being the only difference. On the basis of the observation that E2E training has an advantage in propagating input information, we analyze the information plane dynamics of intermediate representations based on the Hilbert-Schmidt independence criterion (HSIC). The results of our normalized HSIC value analysis reveal the E2E training ability to exhibit different information dynamics across layers, in addition to efficient information propagation. Furthermore, we show that this layer-role differentiation leads to the final representation following the information bottleneck principle. Our work not only provides the advantages of E2E training in terms of information propagation and the information bottleneck but also suggests the need to consider the cooperative interactions between layers, not just the final layer when analyzing the information bottleneck of deep learning.

URL: https://openreview.net/forum?id=O3wmRh2SfT

---

Title: Overcoming Order in Autoregressive Graph Generation for Molecule Generation

Abstract: Graph generation is a fundamental problem in various domains, and is of particular interest in chemistry where graphs may be used to represent molecules. Recent work has shown that molecular graph generation using recurrent neural networks (RNNs) is advantageous compared to traditional generative approaches which require converting continuous latent representations into graphs. One issue which arises when treating graph generation as sequential generation is the arbitrary order of the sequence which results from a particular choice of graph flattening method: in the chemistry setting, molecular graphs commonly have multiple SMILES strings corresponding to the same molecule. Inspired by the use case of molecular graph generation, we propose using RNNs, taking into account the non-sequential nature of graphs by adding an Orderless Regularization (OLR) term that encourages the hidden state of the recurrent model to be invariant to different valid orderings present under the training distribution. We demonstrate that sequential molecular graph generation models benefit from our proposed regularization scheme, especially when data is scarce. Our findings contribute to the growing body of research on graph generation and provide a valuable tool for various applications requiring the synthesis of realistic and diverse graph structures.

URL: https://openreview.net/forum?id=BK6Gc10tRy

---

Title: Fine-tuning can cripple your foundation model; preserving features may be the solution

Abstract: Pre-trained foundation models, due to their enormous capacity and exposure to vast amounts of data during pre-training, are known to have learned plenty of real-world concepts. An important step in making these pre-trained models extremely effective on downstream tasks is to fine-tune them on related datasets. While various fine-tuning methods have been devised and have been shown to be highly effective, we observe that a fine-tuned model's ability to recognize concepts on tasks different from the downstream one is reduced significantly compared to its pre-trained counterpart. This is an undesirable effect of fine-tuning as a substantial amount of resources was used to learn these pre-trained concepts in the first place. We call this phenomenon "concept forgetting'' and via experiments show that most end-to-end fine-tuning approaches suffer heavily from this side effect. To this end, we propose a simple fix to this problem by designing a new fine-tuning method called LDIFS (short for $\ell_2$ distance in feature space) that, while learning new concepts related to the downstream task, allows a model to preserve its pre-trained knowledge as well. Through extensive experiments on 10 fine-tuning tasks we show that LDIFS significantly reduces concept forgetting. Additionally, we show that LDIFS is highly effective in performing continual fine-tuning on a sequence of tasks as well, in comparison with both fine-tuning as well as continual learning baselines.

URL: https://openreview.net/forum?id=kfhoeZCeW7

---

Title: Targeted Active Learning for Bayesian Decision-Making

Abstract: Active learning is usually applied to acquire labels of informative data points in supervised learning, to maximize accuracy in a sample-efficient way. However, maximizing the supervised learning accuracy is not the end goal when the results are used for decision-making, for example in personalized medicine or economics. We argue that when acquiring samples sequentially, the common practice of separating learning and decision-making is sub-optimal, and we introduce an active learning strategy that takes the down-the-line decision problem into account. Specifically, we adopt a Bayesian experimental design approach, in which the proposed acquisition criterion maximizes the expected information gain on the posterior distribution of the optimal decision. We compare our targeted active learning strategy to existing alternatives on both simulated and real data and show improved performance in decision-making accuracy.

URL: https://openreview.net/forum?id=KxPjuiMgmm

---

Title: Todyformer: Towards Holistic Dynamic Graph Transformers with Structure-Aware Tokenization

Abstract: Temporal Graph Neural Networks have garnered substantial attention for their capacity to model evolving structural and temporal patterns while exhibiting impressive performance. However, it is known that these architectures are encumbered by issues that constrain their performance, such as over-squashing and over-smoothing. Meanwhile, Transformers have demonstrated exceptional computational capacity to effectively address challenges related to long-range dependencies. Consequently, we introduce Todyformer—a novel Transformer-based neural network tailored for dynamic graphs. It unifies the local encoding capacity of Message-Passing Neural Networks (MPNNs) with the global encoding of Transformers through i) a novel patchifying paradigm for dynamic graphs to improve over-squashing, ii) a structure-aware parametric tokenization strategy leveraging MPNNs, iii) a Transformer with temporal positional-encoding to capture long-range dependencies, and iv) an encoding architecture that alternates between local and global contextualization, mitigating over-smoothing in MPNNs. Experimental evaluations on public benchmark datasets demonstrate that Todyformer consistently outperforms the state-of-the-art methods for downstream tasks. Furthermore, we illustrate the underlying aspects of the proposed model in effectively capturing extensive temporal dependencies in dynamic graphs.

URL: https://openreview.net/forum?id=nAQSUqEspb

---

Title: Multimodal Chain-of-Thought Reasoning in Language Models

Abstract: Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at \texttt{Anonymous}.

URL: https://openreview.net/forum?id=y1pPWFVfvR

---

Title: Learning Network Granger causality using Graph Prior Knowledge

Abstract: Understanding the relationships among multiple entities through Granger causality graphs
within multivariate time series data is crucial across various domains, including economics,
finance, neurosciences, and genetics. Despite its broad utility, accurately estimating Granger
causality graphs in high-dimensional scenarios with few samples remains a persistent chal-
lenge. In response, this study introduces a novel model that leverages prior knowledge in
the form of a noisy undirected graph to facilitate the learning of Granger causality graphs,
while assuming sparsity. In this study we introduce an optimization problem, we propose
to solve it with an alternative minimization approach and we proved the convergence of
our fitting algorithm, highlighting its effectiveness. Furthermore, we present experimental
results derived from both synthetic and real-world datasets. These results clearly illustrate
the advantages of our proposed method over existing alternatives, particularly in situations
where few samples are available. By incorporating prior knowledge and emphasizing spar-
sity, our approach offers a promising solution to the complex problem of estimating Granger
causality graphs in high-dimensional, data-scarce environments.

URL: https://openreview.net/forum?id=DN6sut5fyR

---

Title: Bit-by-Bit: Investigating the Vulnerabilities of Binary Neural Networks to Adversarial Bit Flipping

Abstract: Binary Neural Networks (BNNs), operating with ultra-low precision weights, incur a significant reduction in storage and compute cost compared to the traditional Deep Neural Networks (DNNs). However, vulnerability of such models against various hardware attacks are yet to be fully unveiled. Towards understanding the potential threat imposed on such highly efficient models, in this paper, we explore a novel adversarial attack paradigm pertaining to BNNs. In specific, we assume the attack to be executed during deployment phase, prior to inference, to achieve malicious intentions, via manipulation of accessible network parameters. We aim to accomplish a graceless degradation in BNN accuracy to a point, where the fully functional network can behave as a random output generator at best, thus subverting the confidence in the system. To this end, we propose an Outlier Gradient-based Evolutionary (OGE) attack, that learns injection of minimal amount of critical bit flips in the pre-trained binary network weights, to introduce classification errors in the inference execution. To the best of our knowledge, this is the first work that leverages the outlier gradient weights to orchestrate a hardware-based bit-flip attack, that is highly effective against the typically resilient low-quantization BNNs. Exhaustive evaluations on popular image recognition datasets including Fashion-MNIST, CIFAR10, GTSRB, and ImageNet demonstrate that, OGE can drop up to 68.1% of the test images mis-classification, by flipping as little as 150 binary weights, out of 10.3 millions in a BNN architecture.

URL: https://openreview.net/forum?id=nB8foAclpo

---

Title: Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning

Abstract: Vision-language models (VLMs) mainly rely on contrastive training to learn general-purpose representations of images and captions. We focus on the situation when one image is associated with several captions, each caption containing both information shared among all captions and unique information per caption about the scene depicted in the image. In such cases, it is unclear whether contrastive losses are sufficient for learning task-optimal representations that contain all the information provided by the captions or whether the contrastive learning setup encourages the learning of a simple shortcut that minimizes contrastive loss. We introduce synthetic shortcuts for vision-language: a training and evaluation framework where we inject synthetic shortcuts into image-text data. We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic shortcuts mainly learn features
that represent the shortcut. Hence, contrastive losses are not sufficient to learn task-optimal representations, i.e., representations that contain all task-relevant information shared between the image and associated captions. We examine two methods to reduce shortcut learning in our training and evaluation framework: (i) latent target decoding and (ii) implicit feature modification. We show empirically that both methods improve performance on the evaluation task, but only partly reduce shortcut learning when training and evaluating with our shortcut learning framework. Hence, we show the difficulty and challenge of our shortcut learning framework for contrastive vision-language representation learning.

URL: https://openreview.net/forum?id=gfANevPraH

---

Title: Intriguing Properties of Hyperbolic Embeddings in Vision-Language Models

Abstract: Vision-language models have in short time been established as powerful networks, demonstrating strong performance on a wide range of downstream tasks. A key factor behind their success is the learning of a joint embedding space where pairs of images and textual descriptions are contrastively aligned. Recent work has explored the geometry of the joint embedding space, finding that hyperbolic embeddings provide a compelling alternative to the commonly used Euclidean embeddings. Specifically, hyperbolic embeddings yield improved zero-shot generalization, better visual recognition, and more consistent semantic interpretations. In this paper, we conduct a deeper study into the hyperbolic embeddings and find that they open new doors for vision-language models. In particular, we find that hyperbolic vision-language models provide spatial awareness that Euclidean vision-language models lack, are better capable of dealing with ambiguity, and effectively discriminate between distributions. Our findings shed light on the greater potential of hyperbolic embeddings in large-scale settings, reaching beyond conventional down-stream tasks.

URL: https://openreview.net/forum?id=P5D2gfi4Gg

---

Title: The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective

Abstract: As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice. However, there is little to no research that provides answers to these critical questions. In this work, we introduce and study the disagreement problem in explainable machine learning. More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how do practitioners resolve these disagreements. To this end, we first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction, and introduce a novel quantitative framework to formalize this understanding. We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and eight different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods. In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements. Our results indicate that (1) state-of-the-art explanation methods often disagree in terms of the explanations they output, and (2) machine learning practitioners often employ ad hoc heuristics when resolving such disagreements. These findings suggest that practitioners may be relying on misleading explanations when making consequential decisions. They also underscore the importance of developing principled frameworks for effectively evaluating and comparing explanations output by various state-of-the-art methods.

URL: https://openreview.net/forum?id=jESY2WTZCe

---

Title: Self-attention-based Diffusion Model for Time-series Imputation in Partial Blackout Scenarios

Abstract: Missing values are a common phenomenon in multivariate time series data, capable of harming the performance of machine learning models and introducing bias and inaccuracies into further analysis. These gaps typically arise from various sources, including sensor malfunctions, extreme events like blackouts, and human error. Previous work has made promising strides in imputation for time series data. However, they mostly dealt with some selective cases of missing patterns such as - missing at random, missing due to complete blackout (all features are missing for a given period of time), and forecasting. In this paper, we delve into a more general category of missing patterns, which we call \textbf{partial blackout}, wherein a subset of features remain missing for one or several consecutive time steps. This describes a more natural scenario that is frequently encountered in real-world applications and covers the aforementioned patterns as special cases. We introduce a two-stage imputation process that explicitly models the feature and temporal correlations with the help of self-attention and diffusion processes. Notably, our model outperforms the state-of-the-art models when dealing with general partial blackout scenarios and exhibits greater scalability, offering promise for practical data imputation needs. The code and the synthetic experiments are here: \hyperref[https://anonymous.4open.science/r/SADI-official-repository-3853/README.md]{https://anonymous.4open.science/r/SADI-official-repository-3853/README.md}.

URL: https://openreview.net/forum?id=79AtAA2bVD

---

Title: Towards Understanding Label Smoothing

Abstract: Label smoothing regularization (LSR) has a great success in training deep neural networks by stochastic algorithms such as stochastic gradient descent and its variants. However, the theoretical understanding of its power from the view of optimization is still rare. This study opens the door to a deep understanding of LSR by initiating the analysis. In this paper, we analyze the convergence behaviors of stochastic gradient descent with label smoothing regularization for solving non-convex problems and show that an appropriate LSR can help to speed up the convergence by reducing the variance. More interestingly, we proposed a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA), that uses LSR in the early training epochs and drops it off in the later training epochs. We observe from the improved convergence result of TSLA that it benefits from LSR in the first stage and essentially converges faster in the second stage. To the best of our knowledge, this is the first work for understanding the power of LSR via establishing convergence complexity of stochastic methods with LSR in non-convex optimization. We empirically demonstrate the effectiveness of the proposed method in comparison with baselines on training ResNet models over benchmark data sets.

URL: https://openreview.net/forum?id=rPyxSRxSLJ

---

Title: Revisiting Active Learning in the Era of Vision Foundation Models

Abstract: Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for \textit{active learning} (AL), which aims to maximize labeling efficiency. However, the full potential of foundation models has not been explored in the context of AL, specifically in the low-budget regime. In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, OpenCLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature. Source code will be made available.

URL: https://openreview.net/forum?id=u8K83M9mbG

---

Title: Piecewise-Stationary Dueling Bandits

Abstract: We study the piecewise-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments, each of which is associated with its own preference matrix.
The learner repeatedly selects a pair of arms and observes a binary preference between them as feedback.
To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible, despite preference matrices and segment lengths being unknown.
We propose the Beat the Winner Reset algorithm and prove a bound on its expected binary weak regret in the stationary case, which tightens the bound of current state-of-art algorithms.
We also show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$.
We further propose and analyze two meta-algorithms, DETECT for weak regret and Monitored Dueling Bandits for strong regret, both based on a detection-window approach that can incorporate any dueling bandit algorithm as a black-box algorithm.
Finally, we prove a worst-case lower bound for expected weak regret in the non-stationary case.

URL: https://openreview.net/forum?id=WhEHEDP7ZG

---

Title: Simultaneous Dimensionality Reduction: A Data Efficient Approach for Multimodal Representations Learning

Abstract: Current experiments frequently produce high-dimensional, multimodal datasets—such as those combining neural activity and animal behavior or gene expression and phenotypic profiling—with the goal of extracting useful correlations between the modalities. Often, the first step in analyzing such datasets is dimensionality reduction. We explore two primary classes of approaches to dimensionality reduction (DR): Independent Dimensionality Reduction (IDR) and Simultaneous Dimensionality Reduction (SDR). In IDR methods, of which Principal Components Analysis is a paradigmatic example, each modality is compressed independently, striving to retain as much variation within each modality as possible. In contrast, in SDR, one simultaneously compresses the modalities to maximize the covariation between the reduced descriptions while paying less attention to how much individual variation is preserved. Paradigmatic examples include Partial Least Squares and Canonical Correlations Analysis. Even though these DR methods are a staple of statistics, their relative accuracy and data set size requirements are poorly understood. We use a generative linear model to synthesize multimodal data with known variance and covariance structures to examine these questions. We assess the accuracy of the reconstruction of the covariance structures as a function of the number of samples, signal-to-noise ratio, and the number of varying and covarying signals in the data. Using numerical experiments, we demonstrate that linear SDR methods consistently outperform linear IDR methods and yield higher-quality, more succinct reduced-dimensional representations with smaller datasets. Remarkably, regularized CCA can identify low-dimensional weak covarying structures even when the number of samples is much smaller than the dimensionality of the data, which is a regime challenging for all dimensionality reduction methods. Our work corroborates and explains previous observations in the literature that SDR can be more effective in detecting covariation patterns in data. These findings strengthen the intuition that SDR should be preferred to IDR in real-world data analysis when detecting covariation is more important than preserving variation.

URL: https://openreview.net/forum?id=Ni14fXbyTV

---

Title: Towards Prototype Conformity Loss Functions for Better Outlier Detection in Traffic Sign Image Classification

Abstract: Deep neural networks~(DNNs) generate overconfident outputs even in case of miss-detections caused by abnormal data. Consequently, this can lead to unreliable classifications and, thus, potentially lead to issues in safety-critical applications such as automated driving systems. Recent works propose to detect such anomalous data based on probabilistic methods derived from the DNN's internal activation functions, such as the convolutional neural networks (CNN) backbones. This paper shows that such CNNs cannot semantically disentangle similar classes when trained with conventional cross-entropy loss functions, leading to poor out-of-distribution (OOD) detection while applying probabilistic methods for such a purpose. Therefore, we propose to apply the prototype conformity loss (PCL) function from the literature and show that such a contrastive learning method leads to better OOD detection for traffic sign classification. Furthermore, we propose two novel variations of the PCL, namely weighted PCL (WPCL) and multi-scale PCL (MSPCL), which group similar classes and force the DNN to disentangle them from each other. In contrast to existing contrastive OOD detection literature, we do not rely on complex input transformations or augmentations. We perform our experiments on multiple DNNs and two traffic sign classification datasets, which we test against multiple OOD data sources, such as adversarial and non-adversarial augmentation and real-world OOD data. Based on that, we demonstrate that our PCL variations can achieve superior results in OOD detection when the training dataset includes various similar classes.

URL: https://openreview.net/forum?id=urFvixn7uj

---

Title: Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code

Abstract: In this work we systematically review the recent advancements in code processing with language models, covering 50+ models, 30+ evaluation tasks, 170+ datasets, and 700+ related works. We break down code processing models into general language models represented by the GPT family and specialized models that are specifically pretrained on code, often with tailored objectives. We discuss the relations and differences between these models, and highlight the historical transition of code modeling from statistical models and RNNs to pretrained Transformers and LLMs, which is exactly the same course that had been taken by NLP. We also discuss code-specific features such as AST, CFG, and unit tests, along with their application in training code language models, and identify key challenges and potential future directions in this domain.

URL: https://openreview.net/forum?id=hkNnGqZnpa

---

Title: Contrastive Graph Autoencoder for Shape-based Polygon Retrieval from Large Geometry Datasets

Abstract: Retrieval of polygon geometries with similar shapes from maps is a challenging geographic information task. Existing approaches can not process geometry polygons with complex shapes, (multiple) holes and are sensitive to geometric transformations (e.g., rotation). We propose Contrastive Graph Autoencoder (CGAE), a robust and effective graph representation autoencoder for extracting polygon geometries of similar shapes from real-world building maps based on template queries. By leveraging graph message-passing layers, graph feature augmentation and contrastive learning, the proposed CGAE embeds highly discriminative latent embeddings by reconstructing graph features w.r.t. the graph representations of input polygons, outperforming existing graph-based autoencoders (GAEs) in geometry retrieval of similar polygons. Experimentally, we demonstrate this capability based on template query shapes on real-world datasets and show its high robustness to geometric transformations in contrast to existing GAEs, indicating the strong generalizability and versatility of CGAE, including on complex real-world building footprints.

URL: https://openreview.net/forum?id=9fcZNAmnyh

---

Title: MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning

Abstract: Model-based offline reinforcement learning methods (RL) have achieved state-of-the-art performance in many decision-making problems thanks to their sample efficiency and generalizability. Despite these advancements, existing model-based offline RL approaches either focus on theoretical studies without developing practical algorithms or rely on a restricted parametric policy space, thus not fully leveraging the advantages of an unrestricted policy space inherent to model-based methods. To address this limitation, we develop MoMA, a model-based mirror ascent algorithm with general function approximations under partial coverage of offline data. MoMA distinguishes itself from existing literature by employing an unrestricted policy class. In each iteration, MoMA conservatively estimates the value function by a minimization procedure within a confidence set of transition models in the policy evaluation step, then updates the policy with general function approximations instead of commonly-used parametric policy classes in the policy improvement step. Under some mild assumptions, we establish theoretical guarantees of MoMA by proving an upper bound on the suboptimality of the returned policy.
We also provide a practically implementable, approximate version of the algorithm. The effectiveness of MoMA is demonstrated via numerical studies.

URL: https://openreview.net/forum?id=RHUKg8n9tw

---

Title: A Survey on Fairness Without Demographics

Abstract: The issue of bias in Machine Learning (ML) models is a significant challenge for the machine learning community. Real-world biases can be embedded in the data used to train models, prior studies have shown that ML models can learn and even amplify these biases. This can result in unfair treatment of individuals based on their inherent characteristics or sensitive attributes such as gender, race, or age. With the increasing use of ML models in high-stakes scenarios, ensuring fairness is crucial and has gained significant attention from researchers in recent years. However, the challenge of ensuring fairness becomes much greater when the assumption of full access to sensitive attributes does not hold. The settings where the hypothesis does not hold include cases where (1) only limited or noisy demographic information is available, or (2) demographic information is entirely unobserved due to privacy restrictions. In this survey, we review recent research efforts aimed at ensuring fairness when sensitive attributes are missing. We propose a taxonomy of existing works, and more importantly, highlight current challenges and future research directions to stimulate research in ML fairness in the setting of missing sensitive attributes.

URL: https://openreview.net/forum?id=3HE4vPNIfX

---

Title: Improving Predictor Reliability with Selective Recalibration

Abstract: A reliable deep learning system should be able to accurately express its confidence with respect to its predictions, a quality known as calibration. One of the most effective ways to produce reliable confidence estimates with a pre-trained model is by applying a post-hoc recalibration method. Popular recalibration methods like temperature scaling are typically fit on a small amount of data and work in the model’s output space, as opposed to the more expressive feature embedding space, and thus usually have only one or a handful of parameters. However, the target distribution to which they are applied is often complex and difficult to fit well with such a function. To this end we propose selective recalibration, where a selection model learns to reject some user-chosen proportion of the data in order to allow the recalibrator to focus on regions of the input space that can be well-captured by such a model. We provide theoretical analysis to motivate our algorithm, and test our method through comprehensive experiments on difficult medical imaging and zero-shot classification tasks. Our results show that selective recalibration consistently leads to significantly lower calibration error than a wide range of selection and recalibration baselines.

URL: https://openreview.net/forum?id=Aoj9H6jl6F

---

Title: Fair Representation in Submodular Subset Selection: A Pareto Optimization Approach

Abstract: Many machine learning applications, such as feature selection, recommendation, and social advertising, require the joint optimization of the global utility and the representativeness for different groups of items or users. To meet such requirements, we propose a novel multi-objective combinatorial optimization problem called Submodular Maximization with Fair Representation (SMFR), which selects subsets from a ground set, subject to a knapsack or matroid constraint, to maximize a submodular (utility) function $f$ as well as a set of $d$ submodular (representativeness) functions $g_1, \dots, g_d$. We show that the maximization of $f$ might conflict with the maximization of $g_1, \dots, g_d$, so that no single solution can optimize all the objectives at the same time. Therefore, we propose a Pareto optimization approach to SMFR, which finds a set of solutions to approximate all Pareto-optimal solutions with different trade-offs between the objectives. Our method converts an instance of SMFR into several submodular cover instances by adjusting the weights of the objective functions; then it computes a set of solutions by running the greedy algorithm on each submodular cover instance. We prove that our method provides approximation guarantees for SMFR under knapsack or matroid constraints. Finally, we demonstrate the effectiveness of SMFR and our proposed approach in two real-world problems: maximum coverage and recommendation.

URL: https://openreview.net/forum?id=0Hm01Vc8zT

---

Title: Dual-windowed Vision Transformer with Angular Self- Attention

Abstract: Following the great success in natural language processing, transformer-based models have emerged as the competitive model against the convolutional neural networks in computer vision. Vision transformer (ViT) and its subsequent variants have exhibited promising performance in tasks such as image classification, object detection and semantic segmentation. The core of vision transformers is the self-attention mechanism, which models the long-range dependency of different tokens. Conventionally, the attention matrix in self-attention is calculated by the scaled dot-product of \textit{query} (Q) and \textit{key} (K). In this case, the attention weight would depend on norm of Q and K as well as the angle between them. In this paper, we propose a new attention mechanism named angular self-attention, which replaces the scaled dot-product operation with the angular function in order to effectively model the relationship between tokens. In particular, we propose two forms of functions: quadratic and cosine functions, for our angular self-attention. Based on angular self-attention, we design a new vision transformer architecture called dual-windowed angular vision transformer (\textbf{DWAViT}). DWAViT is a hierarchical-structured model characterized by the angular self-attention and a new local window mechanism. We evaluate DWAViT on multiple computer vision benchmarks, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K. Our experimental results also suggest that our model can achieve promising performance on the tasks while maintaining comparable computational cost with that of the baseline models (e.g., Swin Transformer).

URL: https://openreview.net/forum?id=7jgu4oXsGM

---

Reply all
Reply to author
Forward
0 new messages