Weekly TMLR digest for Jul 14, 2024

0 views

Skip to first unread message

TMLR

unread,

Jul 14, 2024, 12:00:11 AM (4 days ago) Jul 14

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: A Survey on Data Selection for Language Models

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

https://openreview.net/forum?id=XfHWcNTSHp

---

Reproducibility Certification: Reproducibility Study: Equal Improvability: A New Fairness Notion Considering the Long-Term Impact

Berkay Chakar, Amina Izbassar, Mina Janićijević, Jakub Tomaszewski

https://openreview.net/forum?id=Yj8fUQGXXL

---

Expert Certification: Incorporating Unlabelled Data into Bayesian Neural Networks

Mrinank Sharma, Tom Rainforth, Yee Whye Teh, Vincent Fortuin

https://openreview.net/forum?id=q2AbLOwmHm

---

Expert Certification: CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun

https://openreview.net/forum?id=mKtlzW0bWc

---

Featured Certification: Grid Cell-Inspired Fragmentation and Recall for Efficient Map Building

Jaedong Hwang, Zhang-Wei Hong, Eric R Chen, Akhilan Boopathy, Pulkit Agrawal, Ila R Fiete

https://openreview.net/forum?id=cT8oOJ6Q6F

---

Accepted papers
===============

Title: A Survey on Data Selection for Language Models

Authors: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

Abstract: A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required.

Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies.

To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

URL: https://openreview.net/forum?id=XfHWcNTSHp

---

Title: Read Between the Layers: Leveraging Multi-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models

Authors: Kyra Ahrens, Hans Hergen Lehmann, Jae Hee Lee, Stefan Wermter

Abstract: We address the Continual Learning (CL) problem, wherein a model must learn a sequence of tasks from non-stationary distributions while preserving prior knowledge upon encountering new experiences. With the advancement of foundation models, CL research has pivoted from the initial learning-from-scratch paradigm towards utilizing generic features from large-scale pre-training. However, existing approaches to CL with pre-trained models primarily focus on separating class-specific features from the final representation layer and neglect the potential of intermediate representations to capture low- and mid-level features, which are more invariant to domain shifts. In this work, we propose LayUP, a new prototype-based approach to CL that leverages second-order feature statistics from multiple intermediate layers of a pre-trained network. Our method is conceptually simple, does not require access to prior data, and works out of the box with any foundation model. LayUP surpasses the state of the art in four of the seven class-incremental learning benchmarks, all three domain-incremental learning benchmarks and in six of the seven online continual learning benchmarks, while significantly reducing memory and computational requirements compared to existing baselines. Our results demonstrate that fully exhausting the representational capacities of pre-trained models in CL goes well beyond their final embeddings.

URL: https://openreview.net/forum?id=ZTcxp9xYr2

---

Title: Reproducibility Study: Equal Improvability: A New Fairness Notion Considering the Long-Term Impact

Authors: Berkay Chakar, Amina Izbassar, Mina Janićijević, Jakub Tomaszewski

Abstract: This reproducibility study aims to evaluate the robustness of Equal Improvability (EI) - an effort-based framework for ensuring long-term fairness. To this end, we seek to analyze the three proposed EI-ensuring regularization techniques, i.e. Covariance-based, KDE-based, and Loss-based EI. Our findings largely substantiate the initial assertions, demonstrating EI’s enhanced performance over Empirical Risk Minimization (ERM) techniques on various test datasets. Furthermore, while affirming the long-term effectiveness in fairness, the study also uncovers challenges in resilience to overfitting, particularly in highly complex models.
Building upon the original study, the experiments were extended to include a new dataset and multiple sensitive attributes. These additional tests further demonstrated the effec- tiveness of the EI approach, reinforcing its continued success. Our study highlights the importance of adaptable strategies in AI fairness, contributing to the ongoing discourse in this field of research.

URL: https://openreview.net/forum?id=Yj8fUQGXXL

---

Title: SeqLink: A Robust Neural-ODE Architecture for Modelling Partially Observed Time Series

Authors: Futoon M. Abushaqra, Hao Xue, Yongli Ren, Flora D. Salim

Abstract: Ordinary Differential Equations (ODEs) based models have become popular as foundation models for solving many time series problems. Combining neural ODEs with traditional RNN models has provided the best representation for irregular time series. However, ODEs-based models typically require the trajectory of hidden states to be defined based on either the initial observed value or the most recent observation, raising questions about their effectiveness when dealing with longer sequences and extended time intervals. In this article, we explore the behaviour of the ODEs-based models in the context of time series data with varying degrees of sparsity. We introduce SeqLink, an innovative neural architecture designed to enhance the robustness of sequence representation. Unlike traditional approaches that solely rely on the hidden state generated from the last observed value, SeqLink leverages ODE latent representations derived from multiple data samples, enabling it to generate robust data representations regardless of sequence length or data sparsity level. The core concept behind our model is the definition of hidden states for the unobserved values based on the relationships between samples (links between sequences). Through extensive experiments on partially observed synthetic and real-world datasets, we demonstrate that SeqLink improves the modelling of intermittent time series, consistently outperforming state-of-the-art approaches.

URL: https://openreview.net/forum?id=WCUT6leXKf

---

Title: Incorporating Unlabelled Data into Bayesian Neural Networks

Authors: Mrinank Sharma, Tom Rainforth, Yee Whye Teh, Vincent Fortuin

Abstract: Conventional Bayesian Neural Networks (BNNs) are unable to leverage unlabelled data to improve their predictions. To overcome this limitation, we introduce Self-Supervised Bayesian Neural Networks, which use unlabelled data to learn models with suitable prior predictive distributions. This is achieved by leveraging contrastive pretraining techniques and optimising a variational lower bound. We then show that the prior predictive distributions of self-supervised BNNs capture problem semantics better than conventional BNN priors. In turn, our approach offers improved predictive performance over conventional BNNs, especially in low-budget regimes.

URL: https://openreview.net/forum?id=q2AbLOwmHm

---

Title: XPL: A Cross-Model framework for Semi-Supervised Prompt Learning in Vision-Language Models

Authors: Omprakash Chakraborty, Aadarsh Sahoo, Rameswar Panda, Abir Das

Abstract: Prompt learning, which focuses on learning soft prompts, has emerged as a promising approach for efficiently adapting pretrained vision-language models (VLMs) to multiple downstream tasks. While prior works have shown promising performances on common benchmarks, they typically rely on labeled data samples only. This greatly discredits the information gain from the vast collection of otherwise unlabeled samples available in the wild. To mitigate this, we propose a simple yet efficient cross-model framework to leverage on the unlabeled samples achieving significant gain in model performance. Specifically, we employ a semi-supervised prompt learning approach which makes the learned prompts invariant to the different views of a given unlabeled sample. The multiple views are obtained using different augmentations on the images as well as by varying the lengths of visual and text prompts attached to these samples. Experimenting with this simple yet surprisingly effective approach over a large number of benchmark datasets, we observe a considerable improvement in the quality of soft prompts thereby making an immense gain in image classification performance. Interestingly, our approach also benefits from out-of-domain unlabeled images highlighting the robustness and generalization capabilities.

URL: https://openreview.net/forum?id=oxAZv3QD6M

---

Title: Revisiting Active Learning in the Era of Vision Foundation Models

Authors: Sanket Rajan Gupte, Josiah Aklilu, Jeffrey J Nirschl, Serena Yeung-Levy

Abstract: Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for _active learning_ (AL), which aims to maximize labeling efficiency. However, the full potential of foundation models has not been explored in the context of AL, specifically in the low-budget regime. In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, OpenCLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature. We also provide a highly performant and efficient implementation of modern AL strategies (including our method) at https://github.com/sanketx/AL-foundation-models.

URL: https://openreview.net/forum?id=u8K83M9mbG

---

Title: CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Authors: Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun

Abstract: This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset samples are available at https://github.com/navervision/CompoDiff.

URL: https://openreview.net/forum?id=mKtlzW0bWc

---

Title: Homogenizing Non-IID Datasets via In-Distribution Knowledge Distillation for Decentralized Learning

Authors: Deepak Ravikumar, Gobinda Saha, Sai Aparna Aketi, Kaushik Roy

Abstract: Decentralized learning enables serverless training of deep neural networks (DNNs) in a distributed manner on multiple nodes. One of the key challenges with decentralized learning is heterogeneity in the data distribution across the nodes. Data heterogeneity results in slow and unstable global convergence and therefore poor generalization performance. In this paper, we propose In-Distribution Knowledge Distillation (IDKD) to address the challenge of heterogeneous data distribution. The goal of IDKD is to homogenize the data distribution across the nodes. While such data homogenization can be achieved by exchanging data among the nodes sacrificing privacy, IDKD achieves the same objective using a common public dataset across nodes without breaking the privacy constraint. This public dataset is different from the training dataset and is used to distill the knowledge from each node and communicate it to its neighbors through the generated labels. With traditional knowledge distillation, the generalization of the distilled model is reduced due to misalignment between the private and public data distribution. Thus, we introduce an Out-of-Distribution (OoD) detector at each node to label a subset of the public dataset that maps close to the local training data distribution. Our experiments on multiple image classification datasets and graph topologies show that the proposed IDKD scheme is more effective than traditional knowledge distillation and achieves state-of-the-art generalization performance on heterogeneously distributed data with minimal communication overhead.

URL: https://openreview.net/forum?id=CuyJkNjIVd

---

Title: Combine and Conquer: A Meta-Analysis on Data Shift and Out-of-Distribution Detection

Authors: Eduardo Dadalto Câmara Gomes, Florence Alberge, Pierre Duhamel, Pablo Piantanida

Abstract: This paper introduces a universal approach to seamlessly combine out-of-distribution (OOD) detection scores. These scores encompass a wide range of techniques that leverage the self-confidence of deep learning models and the anomalous behavior of features in the latent space. Not surprisingly, combining such a varied population using simple statistics proves inadequate. To overcome this challenge, we propose a quantile normalization to map these scores into p-values, effectively framing the problem into a multi-variate hypothesis test. Then, we combine these tests using established meta-analysis tools, resulting in a more effective detector with consolidated decision boundaries. Furthermore, we create a probabilistic interpretable criterion by mapping the final statistics into a distribution with known parameters. Through empirical investigation, we explore different types of shifts, each exerting varying degrees of impact on data. Our results demonstrate that our approach significantly improves overall robustness and performance across diverse OOD detection scenarios. Notably, our framework is easily extensible for future developments in detection scores and stands as the first to combine decision boundaries in this context. The code and artifacts associated with this work are publicly available\footnote{\url{https://github.com/edadaltocg/detectors}}.

URL: https://openreview.net/forum?id=VGNBUS9TrU

---

Title: Directed Graph Transformers

Authors: Qitong Wang, Georgios Kollias, Vasileios Kalantzis, Naoki Abe, Mohammed J Zaki

Abstract: In this paper, we address the problem of capturing graph directionality using transformers. Most existing graph transformers typically capture distances between graph nodes and do not take edge direction into account. This is a limiting assumption since many graph applications need to exploit sophisticated relationships in graph data, such as time, causality, or generic dependency constraints. We introduce a novel graph transformer architecture that explicitly takes into account the directionality between connected graph nodes. To achieve this, we make use of dual encodings to represent both potential roles, i.e., source or target, of each pair of vertices linked by a directed edge. These encodings are learned by leveraging the latent adjacency information extracted from a directional attention module, localized with $k$-hop neighborhood information. Extensive experiments on synthetic and real graph datasets show that our approach can have significant accuracy gains over previous graph transformer (GT) and graph neural network (GNN) approaches, providing state-of-the-art (SOTA) results on inherently directed graphs.

URL: https://openreview.net/forum?id=otTFPjziiK

---

Title: Contextual Policies Enable Efficient and Interpretable Inverse Reinforcement Learning for Populations

Authors: Ville Tanskanen, Chang Rajani, Perttu Hämäläinen, Christian Guckelsberger, Arto Klami

Abstract: Inverse reinforcement learning (IRL) methods learn a reward function from expert demonstrations such as human behavior, offering a practical solution for crafting reward functions for complex environments. However, IRL is computationally expensive when applied to large populations of demonstrators, as existing IRL algorithms require solving a separate reinforcement learning (RL) problem for each individual. We propose a new IRL approach that relies on contextual RL, where an optimal policy is learned for multiple contexts.
We first learn a contextual policy that provides the RL solution directly for a parametric family of reward functions, and then re-use it for IRL on each individual within the population. We motivate our method within the scenario of AI-driven playtesting of videogames, and focus on an interpretable family of reward functions. We evaluate the method on a navigation task and the battle arena game Derk, where it successfully recovers distinct player reward preferences from a simulated population and provides substantial time savings compared to a solid baseline of adversarial IRL.

URL: https://openreview.net/forum?id=4CUkCG6ITe

---

Title: NuTime: Numerically Multi-Scaled Embedding for Large- Scale Time-Series Pretraining

Authors: Chenguo Lin, Xumeng Wen, Wei Cao, Congrui Huang, Jiang Bian, Stephen Lin, Zhirong Wu

Abstract: Recent research on time-series self-supervised models shows great promise in learning semantic representations. However, it has been limited to small-scale datasets, e.g., thousands of temporal sequences. In this work, we make key technical contributions that are tailored
to the numerical properties of time-series data and allow the model to scale to large datasets, e.g., millions of temporal sequences. We adopt the Transformer architecture by first partitioning the input into non-overlapping windows. Each window is then characterized by its
normalized shape and two scalar values denoting the mean and standard deviation within each window. To embed scalar values that may possess arbitrary numerical amplitudes to high-dimensional vectors, we propose a numerically multi-scaled embedding module enumerating all possible numerical scales for the scalars. The model undergoes pretraining with a simple contrastive objective on a large-scale dataset over a million sequences collected by merging existing public data. We study its transfer performance on a number of
univariate and multivariate classification tasks, few shot learning, unsupervised clustering and anomaly detection benchmarks. Our method exhibits remarkable improvement against previous pretraining approaches and establishes the new state of the art, even compared with domain-specific non-learning-based methods.

URL: https://openreview.net/forum?id=TwiSBZ0p9u

---

Title: Intriguing Properties of Hyperbolic Embeddings in Vision-Language Models

Authors: Sarah Ibrahimi, Mina Ghadimi Atigh, Nanne Van Noord, Pascal Mettes, Marcel Worring

Abstract: Vision-language models have in short time been established as powerful networks, demonstrating strong performance on a wide range of downstream tasks. A key factor behind their success is the learning of a joint embedding space where pairs of images and textual descriptions are contrastively aligned. Recent work has explored the geometry of the joint embedding space, finding that hyperbolic embeddings provide a compelling alternative to the commonly used Euclidean embeddings. Specifically, hyperbolic embeddings yield improved zero-shot generalization, better visual recognition, and more consistent semantic interpretations. In this paper, we conduct a deeper study into the hyperbolic embeddings and find that they open new doors for vision-language models. In particular, we find that hyperbolic vision-language models provide spatial awareness that Euclidean vision-language models lack, are better capable of dealing with ambiguity, and effectively discriminate between distributions. Our findings shed light on the greater potential of hyperbolic embeddings in large-scale settings, reaching beyond conventional down-stream tasks. Our code is available at https://github.com/saibr/hypvl

URL: https://openreview.net/forum?id=P5D2gfi4Gg

---

Title: Lyra: Orchestrating Dual Correction in Automated Theorem Proving

Authors: Chuanyang Zheng, Haiming Wang, Enze Xie, Zhengying Liu, Jiankai Sun, Huajian Xin, Jianhao Shen, Zhenguo Li, Yu Li

Abstract: Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framework that employs two distinct correction mechanisms: Tool Correction (TC) and Conjecture Correction (CC). To implement Tool Correction in the post-processing of formal proofs, we leverage prior knowledge to utilize predefined prover tools (e.g., Sledgehammer) for guiding the replacement of incorrect tools. Tool Correction significantly contributes to mitigating hallucinations, thereby improving the overall accuracy of the proof. In addition, we introduce Conjecture Correction, an error feedback mechanism designed to interact with prover to refine formal proof conjectures with prover error messages. Compared to the previous refinement framework, the proposed Conjecture Correction refines generation with instruction but does not collect paired (generation, error & refinement) prompts. Our method has achieved state-of-the-art (SOTA) performance on both miniF2F validation (48.0% → 55.3%) and test (45.5% → 51.2%). We also present 3 IMO problems solved by Lyra. We believe Tool Correction (post-process for hallucination mitigation) and Conjecture Correction (subgoal adjustment from interaction with the environment) could provide a promising avenue for future research in this field.

URL: https://openreview.net/forum?id=Svt75kotzs

---

Title: Understanding the Role of Invariance in Transfer Learning

Authors: Till Speicher, Vedant Nanda, Krishna P. Gummadi

Abstract: Transfer learning is a powerful technique for knowledge-sharing between different tasks. Recent work has found that the representations of models with certain invariances, such as to adversarial input perturbations, achieve higher performance on downstream tasks. These findings suggest that invariance may be an important property in the context of transfer learning. However, the relationship of invariance with transfer performance is not fully understood yet and a number of questions remain. For instance, how important is invariance compared to other factors of the pretraining task? How transferable is learned invariance? In this work, we systematically investigate the importance of representational invariance for transfer learning, as well as how it interacts with other parameters during pretraining. To do so, we introduce a family of synthetic datasets that allow us to precisely control factors of variation both in training and test data. Using these datasets, we a) show that for learning representations with high transfer performance, invariance to the right transformations is as, or often more, important than most other factors such as the number of training samples, the model architecture and the identity of the pretraining classes, b) show conditions under which invariance can harm the ability to transfer representations and c) explore how transferable invariance is between tasks.
The code is available [here](https://github.com/tillspeicher/representation-invariance-transfer).

URL: https://openreview.net/forum?id=spJI4LSPIU

---

Title: Jigsaw Game: Federated Clustering

Authors: JINXUAN XU, Hong-You Chen, Wei-Lun Chao, Yuqian Zhang

Abstract: Federated learning has recently garnered significant attention, especially within the domain of supervised learning. However, despite the abundance of unlabeled data on end-users, unsupervised learning problems such as clustering in the federated setting remain underexplored. In this paper, we investigate the federated clustering problem, with a focus on federated k-means. We outline the challenge posed by its non-convex objective and data heterogeneity in the federated framework. To tackle these challenges, we adopt a new perspective by studying the structures of local solutions in k-means and propose a one-shot algorithm called FeCA (Federated Centroid Aggregation). FeCA adaptively refines local solutions on clients, then aggregates these refined client solutions to recover the global solution of the entire dataset in a single round. We empirically demonstrate the robustness of FeCA under various federated scenarios on both synthetic and real-world data. Additionally, we extend FeCA to representation learning and present DeepFeCA, which combines DeepCluster and FeCA for unsupervised feature learning in the federated setting.

URL: https://openreview.net/forum?id=8YcUJbxmmC

---

Title: Fast, accurate and lightweight sequential simulation-based inference using Gaussian locally linear mappings

Authors: Henrik Häggström, Pedro L. C. Rodrigues, Geoffroy Oudoumanessah, Florence Forbes, Umberto Picchini

Abstract: Bayesian inference for complex models with an intractable likelihood can be tackled using algorithms performing many calls to computer simulators. These approaches are collectively known as "simulation-based inference" (SBI). Recent SBI methods have made use of neural networks (NN) to provide approximate, yet expressive constructs for the unavailable likelihood function and the posterior distribution. However, the trade-off between accuracy and computational demand leaves much space for improvement. In this work, we propose an alternative that provides both approximations to the likelihood and the posterior distribution, using structured mixtures of probability distributions. Our approach produces accurate posterior inference when compared to state-of-the-art NN-based SBI methods, even for multimodal posteriors, while exhibiting a much smaller computational footprint. We illustrate our results on several benchmark models from the SBI literature and on a biological model of the translation kinetics after mRNA transfection.

URL: https://openreview.net/forum?id=Q0nzpRcwWn

---

Title: SPriFed-OMP: A Differentially Private Federated Learning Algorithm for Sparse Basis Recovery

Authors: Ajinkya K Mulay, Xiaojun Lin

Abstract: Sparse basis recovery is a classical and important statistical learning problem when the number of model dimensions $p$ is much larger than the number of samples $n$. However, there has been little work that studies sparse basis recovery in the Federated Learning (FL) setting, where the client data's differential privacy (DP) must also be simultaneously protected. In particular, the performance guarantees of existing DP-FL algorithms (such as DP-SGD) will degrade significantly when $p \gg n$, and thus, they will fail to learn the true underlying sparse model accurately. In this work, we develop a new differentially private sparse basis recovery algorithm for the FL setting, called SPriFed-OMP. SPriFed-OMP converts OMP (Orthogonal Matching Pursuit) to the FL setting. Further, it combines SMPC (secure multi-party computation) and DP to ensure that only a small amount of noise needs to be added in order to achieve differential privacy. As a result, SPriFed-OMP can efficiently recover the true sparse basis for a linear model with only $n = \mathcal{O}(\sqrt{p})$ samples. We further present an enhanced version of our approach, SPriFed-OMP-GRAD, based on gradient privatization, that improves the performance of SPriFed-OMP. Our theoretical analysis and empirical results demonstrate that both SPriFed-OMP and SPriFed-OMP-GRAD terminate in a small number of steps, and they significantly outperform the previous state-of-the-art DP-FL solutions in terms of the accuracy-privacy trade-off.

URL: https://openreview.net/forum?id=Dsavre6gjN

---

Title: Active Sequential Two-Sample Testing

Authors: Weizhi Li, Prad Kadambi, Pouria Saidi, Karthikeyan Natesan Ramamurthy, Gautam Dasarathy, Visar Berisha

Abstract: A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries}. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the ``high-dependency'' features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an \emph{anytime-valid} $p$-value. In addition, we characterize the proposed framework's gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control.

URL: https://openreview.net/forum?id=EzPRgIq2Tk

---

Title: Simple and Scalable Strategies to Continually Pre-train Large Language Models

Authors: Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats Leon Richter, Quentin Gregory Anthony, Eugene Belilovsky, Timothée Lesort, Irina Rish

Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models—saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that autoregressive transformer-based LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.

URL: https://openreview.net/forum?id=DimPeeCxKO

---

Title: Grid Cell-Inspired Fragmentation and Recall for Efficient Map Building

Authors: Jaedong Hwang, Zhang-Wei Hong, Eric R Chen, Akhilan Boopathy, Pulkit Agrawal, Ila R Fiete

Abstract: Animals and robots navigate through environments by building and refining maps of space. These maps enable functions including navigation back to home, planning, search and foraging. Here, we use observations from neuroscience, specifically the observed fragmentation of grid cell map in compartmentalized spaces, to propose and apply the concept of Fragmentation-and-Recall (FARMap) in the mapping of large spaces. Agents solve the mapping problem by building local maps via a surprisal-based clustering of space, which they use to set subgoals for spatial exploration. Agents build and use a local map to predict their observations; high surprisal leads to a "fragmentation event" that truncates the local map. At these events, the recent local map is placed into long-term memory (LTM) and a different local map is initialized. If observations at a fracture point match observations in one of the stored local maps, that map is recalled (and thus reused) from LTM. The fragmentation points induce a natural online clustering of the larger space, forming a set of intrinsic potential subgoals that are stored in LTM as a topological graph. Agents choose their next subgoal from the set of near and far potential subgoals from within the current local map or LTM, respectively. Thus, local maps guide exploration locally, while LTM promotes global exploration. We demonstrate that FARMap replicates the fragmentation points observed in animal studies. We evaluate FARMap on complex procedurally-generated spatial environments and realistic simulations to demonstrate that this mapping strategy much more rapidly covers the environment (number of agent steps and wall clock time) and is more efficient in active memory usage, without loss of performance.

URL: https://openreview.net/forum?id=cT8oOJ6Q6F

---

Title: Sparse Contextual CDF Regression

Authors: Kamyar Azizzadenesheli, William Lu, Anuran Makur, Qian Zhang

Abstract: Estimating cumulative distribution functions (CDFs) of context-dependent random variables is a central statistical task underpinning numerous applications in machine learning and economics. In this work, we extend a recent line of theoretical inquiry into this domain by analyzing the problem of \emph{sparse contextual CDF regression}, wherein data points are sampled from a convex combination of $s$ context-dependent CDFs chosen from a set of $d$ basis functions. We show that adaptations of several canonical regression methods serve as tractable estimators in this functional sparse regression setting under standard assumptions on the conditioning of the basis functions. In particular, given $n$ data samples, we prove estimation error upper bounds of $\tilde{O}(\sqrt{s/n})$ for functional versions of the lasso and Dantzig selector estimators, and $\tilde{O}(\sqrt{s}/\sqrt[4]{n})$ for a functional version of the elastic net estimator. Our results match the corresponding error bounds for finite-dimensional regression and improve upon CDF ridge regression which has $\tilde{O}(\sqrt{d/n})$ estimation error. Finally, we obtain a matching information-theoretic lower bound which establishes the minimax optimality of the lasso and Dantzig selector estimators up to logarithmic factors.

URL: https://openreview.net/forum?id=AIc48TjuSt

---

New submissions
===============

Title: Evaluating the Evaluators: Are Validation Methods for Few-Shot Learning Fit for Purpose?

Abstract: Numerous benchmarks for Few-Shot Learning have been proposed in the last decade. However all of these benchmarks focus on performance averaged over many tasks, and the question of how to reliably evaluate and tune models trained for individual few-shot tasks has not been addressed. This paper presents the first investigation into task-level validation---a fundamental step when deploying a model. We measure the accuracy of performance estimators in the few-shot setting, consider strategies for model selection, and examine the reasons for the failure of evaluators usually thought of as being robust. We conclude that cross-validation with a low number of folds is the best choice for directly estimating the performance of a model, whereas using bootstrapping or cross validation with a large number of folds is better for model selection purposes. Overall, we find that with current methods, benchmarks, and validation strategies, one can not get a reliable picture of how effectively methods perform on individual tasks. However, we find that existing methods already provide enough information to enable selection of few-shot learners on a task-level basis.

URL: https://openreview.net/forum?id=dKKY2mDEnD

---

Title: Augmenting cross-entropy with margin loss and applying moving average logits regularization to enhance adversarial robustness

Abstract: Despite significant progress in enhancing adversarial robustness, achieving a satisfactory level remains elusive, with a notable gap persisting between natural and adversarial accuracy. Recent studies have focused on mitigating inherent vulnerabilities in deep neural
networks (DNNs) by augmenting existing methodologies with additional data or reweighting strategies. However, most reweighting strategies often perform poorly against stronger attacks, and generating additional data often entails increased computational demands. Our work proposes an enhancement strategy that complements the cross-entropy loss with a margin-based loss for generating adversarial samples used in training and in the training loss function of promising methodologies. We suggest regularizing the training process by minimizing the discrepancy between the Exponential Moving Average (EMA) of adversarial and natural logits. Additionally, we introduce a novel training objective called Logits Moving Average Adversarial Training (LMA-AT). Our experimental results demonstrate the efficacy of our proposed method, which achieves a more favorable balance between natural and adversarial accuracy, thereby reducing the disparity between the two.

URL: https://openreview.net/forum?id=ZRybMTg9aB

---

Title: WaveletGPT: Wavelets Meet Large Language Models

Abstract: Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. They are trained on a simple objective: to predict the next token given the previous context. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure associated with them. This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbf{any extra parameters} to a GPT-style LLM architecture, we achieve the same pre-training performance almost twice as fast for LLMs in text, raw audio, and symbolic music by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a much larger neural architecture. Our architecture allows every next token prediction to have access to intermediate embeddings at different temporal resolution in every Transformer decoder layer. This work will hopefully pave the way for incorporating multi-rate signal processing ideas into traditional large language model pre-training. Further, we showcase pushing model performance by improving internal structure as opposed to just going after scale.

URL: https://openreview.net/forum?id=4B9Nfg8pwa

---

Title: CW-CNN & CW-AN: Convolutional Networks and Attention Networks for CW-Complexes

Abstract: We present a novel framework for learning on CW-complex structured data points. Recent advances have discussed CW-complexes as ideal learning representations for problems in cheminformatics. However, there is a lack of available machine learning methods suitable for learning on CW-complexes. In this paper we develop notions of convolution and attention that are well defined for CW-complexes. These notions enable us to create the first neural network that can receive a CW-complex as input. We illustrate and interpret this framework in the context of supervised prediction.

URL: https://openreview.net/forum?id=1fLPE8k4ij

---

Title: Hashing with Uncertainty Quantification via Sampling-based Hypothesis Testing

Abstract: To quantify different types of uncertainty when deriving hash-codes for image retrieval, we develop a probabilistic hashing model (ProbHash). Sampling-based hypothesis testing is then derived for hashing with uncertainty quantification (HashUQ) in ProbHash to improve the granularity of hashing-based retrieval by prioritizing the data with confident hash-codes. HashUQ can drastically improve the retrieval performance without sacrificing computational efficiency. For efficient deployment of HashUQ in real-world applications, we discretize the quantified uncertainty to reduce the potential storage overhead. Experimental results show that our HashUQ can achieve state-of-the-art retrieval performance on three image datasets. Ablation experiments on model hyperparameters, different model components, and effects of UQ are also provided with performance comparisons.

URL: https://openreview.net/forum?id=cc4v6v310f

---

Title: Denoised Predictions: Combining Information Bottleneck and Predictive Information to learn denoised representations

Abstract: Humans excel at isolating relevant information from noisy data to predict the behavior of dynamic systems, effectively disregarding non-informative, temporally-correlated noise. In contrast, existing reinforcement learning algorithms face challenges in generating noise-free
predictions within high-dimensional, noise-saturated environments, especially when trained on world models featuring realistic background noise extracted from natural video streams. We propose a novel information-theoretic approach that learn world models based on minimizing the past information and retaining maximal information about the future, aiming at simultaneously learning control policies and at producing denoised predictions. Utilizing Soft Actor-Critic agents augmented with an information-theoretic auxiliary loss, we validate our method’s effectiveness on complex variants of the standard DeepMind Control Suite tasks, where natural videos filled with intricate and task-irrelevant information serve as a background. Experimental results demonstrate that our model outperforms nine state-of-the-art approaches in various settings where natural videos serve as dynamic background noise. Our analysis also reveals that all these methods encounter challenges in more complex environments.

URL: https://openreview.net/forum?id=YmE49eLuz2

---

Title: Mislabeled examples detection viewed as probing machine learning models: concepts, survey and extensive benchmark

Abstract: Mislabeled examples are ubiquitous in real world machine learning datasets. We show that most mislabeled detection methods can be viewed as probing trained machine learning models using a few core principles. We formalize a modular framework that encompasses these methods, parameterized by only 4 building blocks, as well as a python library that showcases that these principles can actually be implemented. The focus is put on classifier-agnostic concepts, with an emphasis on adapting methods developed for deep learning models to non-deep classifiers for tabular data. We benchmark existing methods on Completely At Random (NCAR) and Not At Random (NNAR) labeling noise coming from a series of dataset with automatic labeling rules. This benchmark offers new insights as well as limitations of existing methods in this setup.

URL: https://openreview.net/forum?id=3YlOr7BHkx

---

Title: An Attribute-based Method for Video Anomaly Detection

Abstract: Video anomaly detection (VAD) identifies suspicious events in videos, which is critical for crime prevention and homeland security. In this paper, we propose a simple but highly effective VAD method that relies on attribute-based representations. The base version of our method represents every object by its velocity and pose, and computes anomaly scores by density estimation. Surprisingly, this simple representation is sufficient to achieve state-of-the-art performance in ShanghaiTech, the most commonly used VAD dataset. Combining our attribute-based representations with an off-the-shelf, pretrained deep representation yields state-of-the-art performance with a $99.1\%, 93.7\%$, and $85.9\%$ AUROC on Ped2, Avenue, and ShanghaiTech, respectively.

URL: https://openreview.net/forum?id=XL1N6iLr0G

---

Title: PolygoNet: Leveraging Polygonal Contours for Efficient Image Classification with deep neural networks

Abstract: In recent years, deep learning models have demonstrated remarkable capabilities in various image-related tasks, yet they are often plagued by computational complexity and susceptibility to overfitting. In this paper, we propose a novel approach that leverages efficient polygon representation through dominant points for the input images to address these challenges for image classification tasks. Our method focuses on transforming input images into polygon representations, which are subsequently utilized for training deep neural networks. The key contribution lies in the use of theses dominant points, which offer a concise and flexible representation of images. By transforming images into dominant points, we significantly reduce the computational burden associated with processing large image datasets. This reduction in calculation not only accelerates the training process but also conserves computational resources, making our approach particularly appealing for real-time applications and resource-constrained environments. We validate our approach through extensive experiments on benchmark datasets, showcasing its effectiveness in reducing computation. The experimental results demonstrate that our method achieves state-of-the-art performance across various image classification tasks, underscoring its potential on standard configuration and edge computing configuration.

URL: https://openreview.net/forum?id=QY7y1fPIj1

---

Title: FaAlGrad: Fairness through Alignment of Gradients across Different Subpopulations

Abstract: The growing deployment of Machine Learning systems has increased interest in systems optimized for other important criteria along with the expected task performance. For instance, machine learning models often exhibit biases that lead to unfair outcomes for certain protected subpopulations. This work aims to handle the bias in machine learning models and enhance their fairness by aligning the loss gradients. Specifically, leveraging the meta-learning technique, we propose a novel training framework that aligns the gradients computed across different subpopulations for learning fair classifiers. Aligning the gradients enables our framework to regularize the training process, thereby prioritizing fairness over predictive accuracy. Our experiments on multiple benchmark datasets demonstrate significant improvements in fairness metrics without having any exclusive regularizers for fairness. Thus our work contributes to developing fairer machine learning models with broader societal benefits.

URL: https://openreview.net/forum?id=k4AxEwTaHq

---

Title: Self-supervised Color Generalization in Reinforcement Learning

Abstract: A challenge in reinforcement learning lies in effectively deploying trained policies to handle out-of-distribution data and environmental variations. Agents observing pixel-based image data are generally sensitive to background distractions and color changes. Commonly, color generalization is achieved through data augmentation. In contrast, we propose a color-invariant neural network layer that adopts distinct color symmetries in a self-supervised fashion. This allows for color sensitivity while achieving generalization. Our approach is based on dynamic-mode decomposition, which also accommodates spatial and temporal symmetries; we discuss the controlled breaking of the latter. We empirically evaluate our method in the Minigrid, Procgen, and DeepMind Control suites and find improved color sensitivity and generalisation.

URL: https://openreview.net/forum?id=4On0PLRI8H

---

Title: The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Abstract: Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices.
We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.

URL: https://openreview.net/forum?id=tH1dQH20eZ

---

Title: Masked Capsule Autoencoders

Abstract: We propose Masked Capsule Autoencoders (MCAE), the first Capsule Network that utilises pretraining in a self-supervised manner. Capsule Networks have emerged as a powerful alternative to Convolutional Neural Networks (CNNs), and have shown favourable properties when compared to Vision Transformers (ViT), but have struggled to effectively learn when presented with more complex data, leading to Capsule Network models that do not scale to modern tasks. Our proposed MCAE model alleviates this issue by reformulating the Capsule Network to use masked image modelling as a pretraining stage before finetuning in a supervised manner. Across several experiments and ablations studies we demonstrate that similarly to CNNs and ViTs, Capsule Networks can also benefit from self-supervised pretraining, paving the way for further advancements in this neural network domain. For instance, pretraining on the Imagenette dataset, a dataset of 10 classes of Imagenet-sized images, we achieve not only state-of-the-art results for Capsule Networks but also a 9\% improvement compared to purely supervised training. Thus we propose that Capsule Networks benefit from and should be trained within a masked image modelling framework, with a novel capsule decoder, to improve a Capsule Network's performance on realistic-sized images.

URL: https://openreview.net/forum?id=JHxrh00W1j

---

Title: Mixture of Latent Experts Using Tensor Products

Abstract: In multi-task learning, the conventional approach involves training a model on multiple tasks simultaneously. However, the training signals from different tasks can interfere with one another, potentially leading to \textit{negative transfer}.
To mitigate this, we propose a novel \textit{latent-expert} approach (\texttt{TensorPoly}), that balances parameter efficiency with nuanced routing methods. For \textit{experts}, we reparameterize Low-Rank Adaptation (\texttt{LoRA}) by employing an entangled tensor through the use of tensor product operations and name the resulting approach \texttt{TLoRA}. For \textit{routing function}, we tailor two innovative routing functions according to the granularity: \texttt{TensorPoly-I} which directs to each rank within the entangled tensor while \texttt{TensorPoly-II} offers a finer-grained routing approach targeting each order of the entangled tensor. The experimental results from the multi-task T0-benchmark demonstrate that: 1) all latent-expert approaches surpass the corresponding dense approaches, highlighting the potential of modular language models to mitigate negative inference in multi-task learning and deliver superior outcomes. 2) \texttt{TensorPoly-I} achieves higher parameter efficiency in adaptation and outperforms other modular LMs, which shows the potential of our approach in multi-task transfer learning \footnote{we will release the code when the paper is accepted}.

URL: https://openreview.net/forum?id=SgxeJW4DGk

---

Title: When Stability meets Sufficiency: Informative Explanations that do not Overwhelm

Abstract: Recent studies evaluating various criteria for explainable artificial intelligence (XAI) suggest that fidelity, stability, and comprehensibility are among the most important metrics considered by users of AI across a diverse collection of usage contexts. We consider these criteria as applied to feature-based attribution methods, which are amongst the most prevalent in XAI literature. Going beyond standard correlation, methods have been proposed that highlight what should be minimally sufficient to justify the classification of an input (viz. pertinent positives). While minimal sufficiency is an attractive property akin to comprehensibility, the resulting explanations are often too sparse for a human to understand and evaluate the local behavior of the model. To overcome these limitations, we incorporate the criteria of stability and fidelity and propose a novel method called Path-Sufficient Explanations Method (PSEM) that outputs a sequence of stable and sufficient explanations for a given input of strictly decreasing size (or value) -- from original input to a minimally sufficient explanation -- which can be thought to trace the local boundary of the model in a stable manner, thus providing better intuition about the local model behavior for the specific input. We validate these claims, both qualitatively and quantitatively, with experiments that show the benefit of PSEM across three modalities (image, tabular and text) as well as versus other path explanations. A user study depicts the strength of the method in communicating the local behavior, where (many) users are able to correctly determine the prediction made by a model.

URL: https://openreview.net/forum?id=8JNXOB6FtW

---

Title: Assessing and enhancing robustness of active learning strategies to spurious bias

Abstract: In the presence of spurious correlation, the deep neural network (DNN) trained using empirical risk minimization (ERM) tends to rely on spurious features during predictions, particularly when the target label exhibits spurious correlations with certain attributes in the training set. Prior works have proposed methods to mitigate bias caused by spurious correlations in passive learning scenarios. In this work, we focus on investigating the performance of common active learning (AL) algorithms under spurious bias and designing an AL algorithm that is robust to spurious bias. AL is a framework that iteratively acquires new samples to progressively improve the classifier. In AL loops, sample acquisition is directed by the informativeness criteria, such as uncertainty and representativeness. The concept behind these criteria shares similarities with approaches to addressing spurious correlations in passive settings (i.e., underrepresented samples are deemed informative and thus given higher value during training). In fact, Tamkin et al. (2022) has demonstrated the potential of AL in addressing out-of-distribution problems. Hence, with an appropriately defined acquisition function, a sample-efficient framework can be established to effectively handle spurious correlations. Inspired by recent works on simplicity bias, we propose Domain-Invariant Active Learning (DIAL) which leverages the disparity in training dynamics between overrepresented and underrepresented samples, selecting samples that exhibit “slow” training dynamics. DIAL involves no excessively resource-intensive computations beyond the standard training process and feedforward inference, making it more scalable for addressing real-world problems with AL. Empirical results demonstrates that DIAL not only outperforms baselines in achieving robustness performance under the spurious correlation scenarios but also on the standard ML datasets.

URL: https://openreview.net/forum?id=2XVECaYiFB

---

Title: Toward a Complete Criterion for Value of Information in Insoluble Decision Problems

Abstract: In a decision problem, observations are said to be material if they must be taken into account, to perform optimally. Decision problems have an underlying (graphical) causal structure, which sometimes may used to evaluate certain observations as immaterial For soluble graphs — ones where important past observations are remembered — there is a complete graphical criterion; one that rules out materiality whenever this can be done on the basis of the graphical structure alone. In this work, we analyse a proposed criterion for insoluble graphs. In particular, we prove that some of the conditions used to prove immateriality are necessary; when they are not satisfied, materiality is possible. We discuss possible avenues and obstacles for proving necessity of the remaining conditions.

URL: https://openreview.net/forum?id=0RUzRV05Jn

---

Title: The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications

Abstract: Score-based models have achieved remarkable results in the generative modeling of multiple domains. By learning the gradient of smoothed data distributions, they can iteratively generate samples from complex distributions, e.g., natural images.
The learned score function enables their generalization capability, but its structure and relation to the underlying data manifold remain largely unclear.
Here, we aim to identify such structures through a normative analysis of diffusion models with the exact score of tractable distributions, e.g. Gaussian and Gaussian mixture.
We find, diffusion model with Gaussian score admits a closed-form solution, which predicts many qualitative aspects of sample generation dynamics.
Further, we claim that, for high noise scales, the learned neural score is dominated by the linear score of the Gaussian data approximation; for lower noise scales, the learned neural score is more similar to the score of a coarse-grained approximation of data, e.g. Gaussian mixture.
We supply theoretical arguments for this claim and empirically show that the Gaussian approximation is accurate for a surprisingly wide range of noise in practical diffusion models.
We further study the score learning dynamics and find that diffusion models learn the simpler Gaussian score preferentially.
Our findings enable us to precisely predict the initial diffusion trajectory using the Gaussian analytical solution and we can accelerate image sampling 15-30\% by skipping the initial phase while maintaining image quality (with a near state-of-the-art FID score of 1.93 on CIFAR-10 unconditional generation). Our findings strengthen the field's theoretical understanding of how diffusion models work and suggest ways to improve the design and training of diffusion models.

URL: https://openreview.net/forum?id=I0uknSHM2j

---

Title: Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning

Abstract: Replay-based methods in class-incremental learning (CIL) have attained remarkable success. Despite their effectiveness, the inherent memory restriction results in saving a limited number of exemplars with poor diversity. In this paper, we introduce ESCORT, a novel approach that substantially increases the quantity and enhances the diversity of exemplars based on a pre-trained general-purpose diffusion model, without fine-tuning it on target datasets or storing it in the memory buffer. Images are compressed into visual and textual prompts, which are saved instead of the original images, decreasing memory consumption by a factor of 24. In subsequent phases, diverse exemplars are regenerated by the diffusion model. We further propose partial compression and diffusion-based data augmentation to minimize the domain gap between generated exemplars and real images. Comprehensive experiments demonstrate that ESCORT significantly improves CIL performance across multiple benchmarks, e.g., 3.2% above the previous state-of-the-art on ImageNet-100.

URL: https://openreview.net/forum?id=7mKstT00bJ

---

Title: Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics

Abstract: Reconstructing jets, which provide vital insights into the properties and histories of subatomic particles produced in high-energy collisions, is a main problem in data analyses in collider physics. This intricate task deals with estimating the latent structure of a jet (binary tree) and involves parameters such as particle energy, momentum, and types. While Bayesian methods offer a natural approach for handling uncertainty and leveraging prior knowledge, they face significant challenges due to the super-exponential growth of potential jet topologies as the number of observed particles increases. To address this, we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet latent structures. As a second contribution, we leverage the resulting estimator to develop a variational inference algorithm for parameter learning. Building on this, we introduce a variational family using a pseudo-marginal framework for a fully Bayesian treatment of all variables, unifying the generative model with the inference process. We illustrate our method's effectiveness through experiments using data generated with a collider physics generative model, highlighting superior speed and accuracy across a range of tasks.

URL: https://openreview.net/forum?id=pCapRF2vFf

---

Title: Trusted Aggregation (TAG): Backdoor Defense in Federated Learning

Abstract: Federated learning is a framework for training machine learning models from clients with multiple local data sets without access to the data in its aggregate. Instead, a shared model is jointly learned through an interactive process between a centralized server that combines locally learned model gradients or weights from the client. However, the lack of data transparency naturally raises concerns about model security. Recently, several state-of-the-art backdoor attacks have been proposed, which achieve high attack success rates while simultaneously being difficult to detect, leading to compromised federated learning models. In this paper, motivated by differences in the logits of models trained with and without the presence of backdoor attacks, we propose a defense method that can prevent backdoor attacks from influencing the model while maintaining the accuracy of the original classification task. TAG leverages a small validation data set to estimate the most considerable change a benign client's local training can make to the shared model, which can be used to filter clients from updating the shared model. Experimental results on multiple data sets show that TAG defends against backdoor attacks even when 40 percent of user submissions to update the shared model are malicious.

URL: https://openreview.net/forum?id=r9eNUDe2im

---

Title: Differentially Private Latent Diffusion Models

Abstract: Diffusion models (DMs) are one of the most widely used generative models for producing high-quality images. However, a flurry of recent papers points out that DMs are least private forms of image generators, by extracting a significant number of near-identical replicas of training images from DMs. Existing privacy-enhancing techniques for DMs, unfortunately, do not provide a good privacy-utility tradeoff. In this paper, we aim to improve the current state of DMs with differential privacy (DP) by adopting the Latent Diffusion Models (LDMs). LDMs are equipped with powerful pre-trained autoencoders that map the high-dimensional pixels into lower-dimensional latent representations, in which DMs are trained, yielding a more efficient and fast training of DMs. Rather than fine-tuning the entire LDMs, we fine-tune only the attention modules of LDMs with DP-SGD, reducing the number of trainable parameters by roughly 90% and achieving a better privacy-accuracy trade-off. Our approach allows us to generate realistic, high-dimensional images (256x256) conditioned on text prompts with DP guarantees, which, to the best of our knowledge, has not been attempted before. Our approach provides a promising direction for training more powerful, yet training-efficient differentially private DMs, producing high-quality DP images. Our code is available at https://anonymous.4open.science/r/DP-LDM-4525.

URL: https://openreview.net/forum?id=AkdQ266kHj

---

Title: Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

Abstract: We consider a variant of the stochastic gradient descent (SGD) with a random learning rate and reveal its convergence properties. SGD is a widely used stochastic optimization algorithm in machine learning, especially deep learning. Numerous studies reveal the convergence properties of SGD and its simplified variants. Among these, the analysis of convergence using a stationary distribution of updated parameters provides generalizable results. However, to obtain a stationary distribution, the update direction of the parameters must not degenerate, which limits the applicable variants of SGD. In this study, we consider a novel SGD variant, Poisson SGD, which has degenerated parameter update directions and instead utilizes a random learning rate. Consequently, we demonstrate that a distribution of a parameter updated by Poisson SGD converges to a stationary distribution under weak assumptions on a loss function. Based on this, we further show that Poisson SGD finds global minima in non-convex optimization problems and also evaluate the generalization error using this method. As a proof technique, we approximate the distribution by Poisson SGD with that of the bouncy particle sampler (BPS) and derive its stationary distribution, using the theoretical advance of the piece-wise deterministic Markov process (PDMP).

URL: https://openreview.net/forum?id=iRvwtiAaDy

---

Title: Modeling Causal Mechanisms with Diffusion Models for Interventional and Counterfactual Queries

Abstract: We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings. These encodings enable us to directly sample under interventions and perform abduction for counterfactuals. Diffusion models are a natural fit here, since they can encode each node to a latent representation that acts as a proxy for exogenous noise. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Furthermore, we provide theoretical results that offer a methodology for analyzing counterfactual estimation in general encoder-decoder models, which could be useful in settings beyond our proposed approach.

URL: https://openreview.net/forum?id=EDHQDsqiSe

---

Title: Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning

Abstract: We study how a Reinforcement Learning (RL) system can remain sample-efficient when learning from an imperfect model of the environment. This is particularly challenging when the learning system is resource-constrained and in continual settings, where the environment dynamics change. To address these challenges, our paper introduces an online, meta-gradient algorithm that tunes a probability with which states are queried during Dyna-style planning. Our study compares the aggregate, empirical performance of this meta-gradient method to baselines that employ conventional sampling strategies. Results indicate that our method improves efficiency of the planning process, which, as a consequence, improves the sample-efficiency of the overall learning process. On the whole, we observe that our meta-learned solutions avoid several pathologies of conventional planning approaches, such as sampling inaccurate transitions and those that stall credit assignment. We believe these findings could prove useful, in future work, for designing model-based RL systems at scale.

URL: https://openreview.net/forum?id=awiDwOS2k7

---

Title: Personalized Federated Learning of Probabilistic Models: A PAC-Bayesian Approach

Abstract: Federated Learning (FL) aims to infer a shared model from private and decentralized data stored by multiple clients. Personalized FL (PFL) enhances the model’s fit for each client by adapting the global model to the clients. A significant level of personalization is required for highly heterogeneous clients but can be challenging to achieve, especially when clients’ datasets are small. We introduce PAC-PFL for PFL of probabilistic models. PAC-PFL infers a shared hyper-posterior and treats each client’s posterior inference as the personalization step. Unlike previous PFL algorithms, PAC-PFL does not regularize all personalized models towards a single shared model, thereby greatly enhancing its personalization flexibility. By establishing and minimizing a PAC-Bayesian generalization bound on the average true loss of clients, PAC-PFL effectively mitigates overfitting even in data-poor scenarios. Additionally, PAC-PFL provides generalization bounds for new clients joining later. PAC-PFL achieves accurate and well-calibrated predictions, as supported by our experiments.

URL: https://openreview.net/forum?id=ZMliWjMCor

---

Title: Plug, Play, and Generalize: Length Extrapolation with Pointer-Augmented Neural Memory

Abstract: We introduce Pointer-Augmented Neural Memory (PANM), a versatile module designed to enhance neural networks' ability to process symbols and extend their capabilities to longer data sequences. PANM integrates an external neural memory utilizing novel physical addresses and pointer manipulation techniques, emulating human and computer-like symbol processing abilities. PANM facilitates operations like pointer assignment, dereferencing, and arithmetic by explicitly employing physical pointers for memory access. This module can be trained end-to-end on sequence data, empowering various sequential models, from simple recurrent networks to large language models (LLMs). Our experiments showcase PANM's exceptional length extrapolation capabilities and its enhancement of recurrent neural networks in symbol processing tasks, including algorithmic reasoning and Dyck language recognition. PANM enables Transformers to achieve up to 100% generalization accuracy in compositional learning tasks and significantly improves performance in mathematical reasoning, question answering, and machine translation. Notably, the generalization effectiveness scales with stronger backbone models, as evidenced by substantial performance gains when we test LLMs finetuned with PANM for tasks up to 10-100 times longer than the training data.

URL: https://openreview.net/forum?id=dyQ9vFbF6D

---

Title: Measuring Orthogonality in Representations of Generative Models

Abstract: In unsupervised representation learning, models aim to distill essential features from high-dimensional data into lower-dimensional learned representations, guided by inductive biases. Understanding the characteristics that make a good representation remains a topic of ongoing research. Disentanglement of independent generative processes has long been credited with producing high-quality representations.
However, focusing solely on representations that adhere to the stringent requirements of most disentanglement metrics, may result in overlooking many high-quality representations, well suited for various downstream tasks. These metrics often demand that generative factors be encoded in distinct, single dimensions aligned with the canonical basis of the representation space.

Motivated by these observations, we propose two novel metrics: Importance-Weighted Orthogonality (IWO) and Importance-Weighted Rank (IWR). These metrics evaluate the mutual orthogonality and rank of generative factor subspaces. Throughout extensive experiments on common downstream tasks, over several benchmark datasets and models, IWO and IWR consistently show stronger correlations with downstream task performance than traditional disentanglement metrics. Our findings suggest that representation quality is closer related to the orthogonality of independent generative processes rather than their disentanglement, offering a new direction for evaluating and improving unsupervised learning models.

URL: https://openreview.net/forum?id=TSUprKRga1

---

Title: CAPM: Fast and Robust Verification on Maxpool-based CNN via Dual Network

Abstract: This study uses CAPM (Convex Adversarial Polytope for Maxpool-based CNN) to improve the verified bound for general purpose maxpool-based convolutional neural networks (CNNs) under bounded norm adversarial perturbations. The maxpool function is decomposed as a series of ReLU functions to extend the convex relaxation technique to maxpool functions, by which the verified bound can be efficiently computed through a dual network. The experimental results demonstrate that this technique allows the state-of-the-art verification precision for maxpool-based CNNs and involves a much lower computational cost than current verification methods, such as DeepZ, DeepPoly and PRIMA. This method is also applicable to large-scale CNNs, which previous studies show to be often computationally prohibitively expensive. Under certain circumstances, CAPM is 40-times, 20-times or twice as fast and give a significantly higher verification bound (CAPM 98% vs. PRIMA 76%/DeepPoly 73%/DeepZ 8%) as compared to PRIMA/DeepPoly/DeepZ. Furthermore, we additionally present the time complexity of our algorithm as $O(W^2NK)$, where $W$ is the maximum width of the neural network, $N$ is the number of neurons, and $K$ is the size of the maxpool layer's kernel.

URL: https://openreview.net/forum?id=fvItSLnGHP

---

Title: Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks

Abstract: Text-to-image (T2I) diffusion models (DMs) have shown promise in generating high-quality images from textual descriptions. The real-world applications of these models require particular attention to their safety and fidelity, which yet has not been sufficiently explored. One fundamental question is whether the existing T2I DMs are robust against variations over input texts. To answer it, this work provides the first robustness evaluation of T2I DMs against real-world perturbations. Unlike malicious attacks that involve apocryphal alterations to the input texts, we consider a perturbation space spanned by realistic errors (e.g., typo, glyph, phonetic) that humans can make and develop adversarial attacks to generate worst-case perturbations for robustness evaluation. Given the inherent randomness of the generation process, we design four novel distribution-based objectives to mislead T2I DMs. We optimize the objectives in a black-box manner without any knowledge of the model. Extensive experiments demonstrate the effectiveness of our method for attacking popular T2I DMs and simultaneously reveal their non-trivial robustness issues. Moreover, we also offer an in-depth analysis to show our method is not specialized for solely attacking the text encoder in T2I DMs.

URL: https://openreview.net/forum?id=8247dUfWvj

---

Title: Improving GFlowNets for Text-to-Image Diffusion Alignment

Abstract: Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function.
Prior works fine-tune pretrained diffusion models to achieve this goal through reinforcement learning-based algorithms. Nonetheless, they suffer from issues including slow credit assignment as well as low quality in their generated samples.
In this work, we explore techniques that do not directly maximize the reward but rather generate high-reward images with relatively high probability --- a natural scenario for the framework of generative flow networks (GFlowNets).
To this end, we propose the Diffusion Alignment with GFlowNet (DAG) algorithm to post-train diffusion models with black-box property functions.
Extensive experiments on Stable Diffusion and various reward specifications corroborate that our method could effectively align large-scale text-to-image diffusion models with given reward information.

URL: https://openreview.net/forum?id=XDbY3qhM42

---

Title: Explaining Node Embeddings

Abstract: Node embedding algorithms produce low-dimensional latent representations of nodes in a graph. These embeddings are often used for downstream tasks, such as node classification and link prediction. In this paper, we investigate the following two questions: (Q1) Can we explain each embedding dimension with human-understandable graph features (e.g. degree, clustering coefficient and PageRank). (Q2) How can we modify existing node embedding algorithms to produce embeddings that can be easily explained by human-understandable graph features? We find that the answer to Q1 is yes and introduce a new framework called XM (short for eXplain eMbedding) to answer Q2. A key aspect of XM involves minimizing the nuclear norm of the generated explanations. We show that by minimizing the nuclear norm, we minimize the lower bound on the entropy of the generated explanations. We test XM on a variety of real-world graphs and show that XM not only preserves the performance of existing node embedding methods, but also enhances their explainability.

URL: https://openreview.net/forum?id=QQZ8uPxFb3

---

Title: LInK: Learning Joint Representations of Design and Performance Spaces through Contrastive Learning for Mechanism Synthesis

Abstract: In this paper, we introduce LInK, a novel framework that integrates contrastive learning of performance and design space with optimization techniques for solving complex inverse problems in engineering design with discrete and continuous variables.
We focus on the path synthesis problem for planar linkage mechanisms.
By leveraging a multi-modal and transformation-invariant contrastive learning framework, LInK learns a joint representation that captures complex physics and design representations of mechanisms, enabling rapid retrieval from a vast dataset of over 10 million mechanisms.
This approach improves precision through the warm start of a hierarchical unconstrained nonlinear optimization algorithm, combining the robustness of traditional optimization with the speed and adaptability of modern deep learning methods.
Our results on an existing benchmark demonstrate that LInK outperforms existing methods with 28 times less error compared to a state-of-the-art approach while taking 20 times less time on an existing benchmark.
Moreover, we introduce a significantly more challenging benchmark, named LINK-ABC, which involves synthesizing linkages that trace the trajectories of English capital alphabets—an inverse design benchmark task that existing methods struggle with due to large non-linearities and tiny feasible space.
Our results demonstrate that LInK not only advances the field of mechanism design but also broadens the applicability of contrastive learning and optimization to other areas of engineering.

URL: https://openreview.net/forum?id=a1MRjOL6WJ

---

Title: Benchmarking General-Purpose In-Context Learning

Abstract: In-context learning (ICL) empowers generative models to address new tasks effectively and efficiently on the fly, without relying on any artificially crafted optimization techniques. In this paper, we study extending ICL to address a broader range of tasks with an extended learning horizon and higher improvement potential, namely General Purpose In-Context Learning (GPICL). To this end, we introduce two lightweight benchmarks specifically crafted to train and evaluate GPICL functionalities. Each benchmark encompasses a vast number of tasks characterized by significant task variance, facilitating meta-training that minimizes inductive bias. These tasks are also crafted to promote long-horizon in-context learning through continuous generation and interaction. These characteristics necessitate the models to leverage contexts and history interactions to enhance their capabilities, across domains such as language modeling, decision-making, and world modeling. Our experiments on the baseline models demonstrate that meta-training with minimal inductive bias and ICL from the ground up is feasible across all the domains we've discussed. Additionally, our findings indicate that the scale of parameters alone may not be crucial for ICL or GPICL, suggesting alternative approaches such as increasing the scale of contexts and memory states.

URL: https://openreview.net/forum?id=yR6YbwJPTU

---

Title: Semantic Alignment for Prompt-Tuning in Vision Language Models

Abstract: Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability, and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we investigate whether better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs). These class descriptions are used to bridge image and text modalities. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods, demonstrating substantial improvements.

URL: https://openreview.net/forum?id=avDr56QjSI

---

Title: Let There be Direction in Hypergraph Neural Networks

Abstract: Hypergraphs are a powerful abstraction for modeling high-order interactions between a set of entities of interest and have been attracting a growing interest in the graph-learning literature. In particular, directed hypegraphs are crucial in their capability of representing real-world phenomena involving group relations where two sets of elements affect one another in an asymmetric way. Despite such a vast potential, an established solution to tackle graph-learning tasks on directed hypergraphs is still lacking. For this reason, in this paper we introduce the Generalized Directed Hypergraph Neural Network (GeDi-HNN), the first spectral-based Hypergraph Neural Network (HNN) capable of seamlessly handling hypergraphs with both directed and undirected hyperedges. GeDi-HNN relies on a graph-convolution operator which is built on top of the Generalized Directed Laplacian} $\vec{L}_N$, a novel complex-valued Hermitian matrix which we introduce in this paper. We prove that $\vec L_N$ generalizes many previously-proposed Laplacian matrices to directed hypergraphs while enjoying several desirable spectral properties. Extensive computational experiments against state-of-the-art methods on real-world and synthetically-generated datasets demonstrate the efficacy of our proposed HNN. Thanks to effectively leveraging the directional information contained in these datasets, GeDi-HNN achieves a relative-percentage-difference improvement of 7% on average (with a maximum improvement of 23.19%) on the real-world datasets and of 65.3% on average on the synthetic ones.

URL: https://openreview.net/forum?id=h48Ri6pmvi

---

Title: The Anonymised Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models

Abstract: Web automation holds the potential to revolutionize how users interact with the digital world, offering unparalleled assistance and simplifying tasks via sophisticated computational methods. Central to this evolution is the web element nomination task, which entails identifying unique elements on webpages. Unfortunately, the development of algorithmic designs for web automation is hampered by the scarcity of comprehensive and realistic datasets that reflect the complexity faced by real-world applications on the Web. To address this, we introduce the Anonymised Product Page Dataset, a comprehensive and diverse collection of webpages that surpasses existing datasets in richness and variety. The dataset features $51,701$ manually labeled product pages from $8,175$ e-commerce websites across eight geographic regions, accompanied by a dataset of rendered page screenshots. To initiate research on the Anonymised Product Page Dataset, we empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. We make three important contributions. First, we found that a simple Convolutional GNN (GCN) outperforms complex state-of-the-art nomination methods. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page using the aforementioned GCN. These elements are then passed to a Large Language Model for the final nomination. This procedure significantly improves the nomination accuracy by $16.8$ percentage points on our challenging dataset, without any need for fine-tuning. Finally, in response to another prevalent challenge in this field – the abundance of training methodologies suitable for element nomination – we introduce the \emph{Challenge Nomination Training Procedure}, a training method that further boosts nomination accuracy.

URL: https://openreview.net/forum?id=zz6FesdDbB

---

Reply all

Reply to author

Forward

0 new messages