Weekly TMLR digest for Oct 15, 2023

2 views

Skip to first unread message

TMLR

unread,

Oct 14, 2023, 8:00:08 PM10/14/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: Private GANs, Revisited

Alex Bie, Gautam Kamath, Guojun Zhang

https://openreview.net/forum?id=9sVCIngrhP

---

Accepted papers
===============

Title: Self-supervised Learning for Segmentation and Quantification of Dopamine Neurons in Parkinson’s Disease

Authors: Fatemeh Haghighi, soumitra ghosh, Sarah Chu, Hai Ngu, Mohsen Hejrati, Han Hui Lin, Baris Bingol, Somaye Hashemifar

Abstract: Parkinson’s Disease (PD) is the second most common neurodegenerative disease in humans. PD is characterized by the gradual loss of dopaminergic neurons in the Substantia Nigra (SN, a part of the mid-brain). Counting the number of dopaminergic neurons in the SN is one of the most important indexes in evaluating drug efficacy in PD animal models. Currently, analyzing and quantifying dopaminergic neurons is conducted manually by experts through analysis of digital pathology images which is laborious, time-consuming, and highly subjective. As such, a reliable and unbiased automated system is demanded for the quantification of dopaminergic neurons in digital pathology images. Recent years have seen a surge in adopting deep learning solutions in medical image processing. However, developing high-performing deep learning models hinges on the availability of large-scale, high-quality annotated data, which can be expensive to acquire, especially in applications like digital pathology image analysis. To this end, we propose an end-to-end deep learning framework based on self-supervised learning for the segmentation and quantification of dopaminergic neurons in PD animal models. To the best of our knowledge, this is the first deep learning model that detects the cell body of dopaminergic neurons, counts the number of dopaminergic neurons, and provides characteristics of individual dopaminergic neurons as a numerical output. Extensive experiments demonstrate the effectiveness of our model in quantifying neurons with high precision, which can provide a faster turnaround for drug efficacy studies,better understanding of dopaminergic neuronal health status, and unbiased results in PD pre-clinical research. As part of our contributions, we also provide the first publicly available dataset of histology digital images along with expert annotations for the segmentation of TH-positive DA neuronal soma.

URL: https://openreview.net/forum?id=izFnURFG3f

---

Title: Dual Cognitive Architecture: Incorporating Biases and Multi-Memory Systems for Lifelong Learning

Authors: Shruthi Gowda, Bahram Zonooz, Elahe Arani

Abstract: Artificial neural networks (ANNs) exhibit a narrow scope of expertise on stationary independent data. However, the data in the real world is continuous and dynamic, and ANNs must adapt to novel scenarios while also retaining the learned knowledge to become lifelong learners. The ability of humans to excel at these tasks can be attributed to multiple factors ranging from cognitive computational structures, cognitive biases, and the multi-memory systems in the brain. We incorporate key concepts from each of these to design a novel framework, Dual Cognitive Architecture (DUCA), which includes multiple sub-systems, implicit and explicit knowledge representation dichotomy, inductive bias, and a multi-memory system. DUCA shows improvement across different settings and datasets, and it also exhibits reduced task recency bias, without the need for extra information. To further test the versatility of lifelong learning methods on a challenging distribution shift, we introduce a novel domain-incremental dataset DN4IL. In addition to improving performance on existing benchmarks, DUCA also demonstrates superior performance on this complex dataset.

URL: https://openreview.net/forum?id=PEyVq0hlO3

---

Title: Analysis of Convolutions, Non-linearity and Depth in Graph Neural Networks using Neural Tangent Kernel

Authors: Mahalakshmi Sabanayagam, Pascal Esser, Debarghya Ghoshdastidar

Abstract: The fundamental principle of Graph Neural Networks (GNNs) is to exploit the structural information of the data by aggregating the neighboring nodes using a `graph convolution' in conjunction with a suitable choice for the network architecture, such as depth and activation functions. Therefore, understanding the influence of each of the design choice on the network performance is crucial. Convolutions based on graph Laplacian have emerged as the dominant choice with the symmetric normalization of the adjacency matrix as the most widely adopted one. However, some empirical studies show that row normalization of the adjacency matrix outperforms it in node classification. Despite the widespread use of GNNs, there is no rigorous theoretical study on the representation power of these convolutions, that could explain this behavior. Similarly, the empirical observation of the linear GNNs performance being on par with non-linear ReLU GNNs lacks rigorous theory.

In this work, we theoretically analyze the influence of different aspects of the GNN architecture using the Graph Neural Tangent Kernel in a semi-supervised node classification setting. Under the population Degree Corrected Stochastic Block Model, we prove that: (i) linear networks capture the class information as good as ReLU networks; (ii) row normalization preserves the underlying class structure better than other convolutions; (iii) performance degrades with network depth due to over-smoothing, but the loss in class information is the slowest in row normalization; (iv) skip connections retain the class information even at infinite depth, thereby eliminating over-smoothing.
We finally validate our theoretical findings numerically and on real datasets such as Cora and Citeseer.

URL: https://openreview.net/forum?id=xgYgDEof29

---

Title: Zero-shot Node Classification with Graph Contrastive Embedding Network

Authors: Wei Ju, Yifang Qin, Siyu Yi, Zhengyang Mao, Kangjie Zheng, Luchen Liu, Xiao Luo, Ming Zhang

Abstract: This paper studies zero-shot node classification, which aims to predict new classes (i.e., unseen classes) of nodes in a graph. This problem is challenging yet promising in a variety of real-world applications such as social analysis and bioinformatics. The key of zero-shot node classification is to enable the knowledge transfer of nodes from training classes to unseen classes. However, existing methods typically ignore the dependencies between nodes and classes, and fail to be organically integrated in a united way. In this paper, we present a novel framework called the Graph Contrastive Embedding Network (GraphCEN) for zero-shot node classification. Specifically, GraphCEN first constructs an affinity graph to model the relations between the classes. Then the node- and class-level contrastive learning (CL) are proposed to jointly learn node embeddings and class assignments in an end-to-end manner. The two-level CL can be optimized to mutually enhance each other. Extensive experiments indicate that our GraphCEN significantly outperforms the state-of-the-art approaches on multiple challenging benchmark datasets.

URL: https://openreview.net/forum?id=8wGXnjRLSy

---

Title: Sharper Rates and Flexible Framework for Nonconvex SGD with Client and Data Sampling

Authors: Alexander Tyurin, Lukang Sun, Konstantin Pavlovich Burlachenko, Peter Richtárik

Abstract: We revisit the classical problem of finding an approximately stationary point of the average of $n$ smooth and possibly nonconvex functions. The optimal complexity of stochastic first-order methods in terms of the number of gradient evaluations of individual functions is $\mathcal{O}\left(n + n^{1/2}\varepsilon^{-1}\right)$, attained by the optimal SGD methods SPIDER (Fang et al., 2018) and PAGE (Li et al., 2021), for example, where $\varepsilon$ is the error tolerance. However, i) the big-$\mathcal{O}$ notation hides crucial dependencies on the smoothness constants associated with the functions, and ii) the rates and theory in these methods assume simplistic sampling mechanisms that do not offer any flexibility. In this work we remedy the situation. First, we generalize the PAGE (Li et al., 2021) algorithm so that it can provably work with virtually any (unbiased) sampling mechanism. This is particularly useful in federated learning, as it allows us to construct and better understand the impact of various combinations of client and data sampling strategies. Second, our analysis is sharper as we make explicit use of certain novel inequalities that capture the intricate interplay between the smoothness constants and the sampling procedure. Indeed, our analysis is better even for the simple sampling procedure analyzed in the PAGE (Li et al., 2021) paper. However, this already improved bound can be further sharpened by a different sampling scheme which we propose. In summary, we provide the most general and most accurate analysis of optimal SGD in the smooth nonconvex regime. Finally, our theoretical findings are supposed with carefully designed experiments.

URL: https://openreview.net/forum?id=zKgJ6TWAFE

---

Title: Private GANs, Revisited

Authors: Alex Bie, Gautam Kamath, Guojun Zhang

Abstract: We show that the canonical approach for training differentially private GANs -- updating the discriminator with differentially private stochastic gradient descent (DPSGD) -- can yield significantly improved results after modifications to training. Specifically, we propose that existing instantiations of this approach neglect to consider how adding noise only to discriminator updates inhibits discriminator training, disrupting the balance between the generator and discriminator necessary for successful GAN training. We show that a simple fix -- taking more discriminator steps between generator steps -- restores parity between the generator and discriminator and improves results.
Additionally, with the goal of restoring parity, we experiment with other modifications -- namely, large batch sizes and adaptive discriminator update frequency -- to improve discriminator training and see further improvements in generation quality. Our results demonstrate that on standard image synthesis benchmarks, DPSGD outperforms all alternative GAN privatization schemes. Code: https://github.com/alexbie98/dpgan-revisit.

URL: https://openreview.net/forum?id=9sVCIngrhP

---

Title: An Analysis of Model-Based Reinforcement Learning From Abstracted Observations

Authors: Rolf A. N. Starre, Marco Loog, Elena Congeduti, Frans A Oliehoek

Abstract: Many methods for Model-based Reinforcement learning (MBRL) in Markov decision processes (MDPs) provide guarantees for both the accuracy of the model they can deliver and the learning efficiency. At the same time, state abstraction techniques allow for a reduction of the size of an MDP while maintaining a bounded loss with respect to the original problem. Therefore, it may come as a surprise that no such guarantees are available when combining both techniques, i.e., where MBRL merely observes abstract states. Our theoretical analysis shows that abstraction can introduce a dependence between samples collected online (e.g., in the real world). That means that, without taking this dependence into
account, results for MBRL do not directly extend to this setting. Our result shows that we can use concentration inequalities for martingales to overcome this problem. This result makes it possible to extend the guarantees of existing MBRL algorithms to the setting with abstraction. We illustrate this by combining R-MAX, a prototypical MBRL algorithm, with abstraction, thus producing the first performance guarantees for model-based ‘RL from Abstracted Observations’: model-based reinforcement learning with an abstract model.

URL: https://openreview.net/forum?id=YQWOzzSMPp

---

Title: The Kernel Density Integral Transformation

Authors: Calvin McCarter

Abstract: Feature preprocessing continues to play a critical role when applying machine learning and statistical methods to tabular data. In this paper, we propose the use of the kernel density integral transformation as a feature preprocessing step. Our approach subsumes the two leading feature preprocessing methods as limiting cases: linear min-max scaling and quantile transformation. We demonstrate that, without hyperparameter tuning, the kernel density integral transformation can be used as a simple drop-in replacement for either method, offering robustness to the weaknesses of each. Alternatively, with tuning of a single continuous hyperparameter, we frequently outperform both of these methods. Finally, we show that the kernel density transformation can be profitably applied to statistical data analysis, particularly in correlation analysis and univariate clustering.

URL: https://openreview.net/forum?id=6OEcDKZj5j

---

Title: Overcoming Resource Constraints in Federated Learning: Large Models Can Be Trained with only Weak Clients

Authors: Yue Niu, Saurav Prakash, Souvik Kundu, Sunwoo Lee, Salman Avestimehr

Abstract: Federated Learning (FL) is emerging as a popular, promising decentralized learning framework that enables collaborative training among clients, with no need to share private data between them or to a centralized server. However, considering many edge clients do not have sufficient computing, memory, or communication capabilities, federated learning of large models still faces significant bottlenecks. To keep such weak but crucial clients in the loop, prior works either consider a heterogeneous-client setting where clients train models with different sizes; or offload training to the server. However, the heterogeneous-client setting requires some clients to train full model, which is not aligned with the resource-constrained setting; while the latter ones break privacy promises in FL when sharing intermediate representations or labels with the server. To overcome these limitations, in this work, we formulate a realistic, but much less explored, cross-device FL setting in which no client can train a full large model nor is willing to share any intermediate information with the remote server. Under such a formulation, we develop a principal sub-model (PriSM) training methodology to collaboratively train a full large model, while assigning each client a small sub-model that is a probabilistic low-rank approximation to the full server model. When creating sub-models, PriSM first performs a principal kernel analysis in the orthogonal kernel space to obtain importance of each kernel. Then, PriSM adopts a novel importance-aware sampling process to select a subset of kernels (i.e., a kernel with high importance is assigned with a higher sampling probability). This sampling process ensures each sub-model is still a low-rank approximation to the full model, while all sub-models together achieve nearly full coverage on the principal kernels. To further improve memory efficiency while still preserving accuracy, PriSM also exploits low-rank structure in intermediate representations and allows each sub-model to learn only a subset of them. Our evaluations on various datasets and models (CNNs, LSTMs, Transformers) under different resource-constrained settings demonstrate that PriSM yields an accuracy improvement of up to $10\%$ compared to existing works. More importantly, PriSM does not incur significant accuracy degradation compared to full-model training (e.g., only $\sim 2\%$ accuracy drops for ResNet-18/CIFAR-10 when clients train only $0.2\times$ sub-models).

URL: https://openreview.net/forum?id=lx1WnkL9fk

---

New submissions
===============

Title: The Relationship Between the Distribution of Neural Network Weights and Model Accuracy Using Benford’s Law

Abstract: Context: Benford’s Law describes the distribution of atypical patterns of numbers. It focuses on
the occurrence of the first digit in a natural population of numbers. When these numbers are divided
into nine categories based on their first digit, the largest category consists of numbers that start with
1, followed by those starting with 2, and so on. Objective: Each neuron in a Neural Network
(NN) holds a mathematical value, often referred to as a weight, which is updated according to
certain parameters. This study explores the Degree of Benford’s Law Existence (DBLE) within
Convolutional Neural Networks (CNNs). Additionally, the experiment investigates the correlation
between the DBLE and NN’s accuracy. Methods: A (CNN) is subjected to testing 15 times using
various datasets and hyperparameters. The DBLE is calculated for each CNN variation, and the
correlation between the CNN’s performance and DBLE is examined. To further explore the presence
of Benford’s Law in CNN models, nine transfer learning models are also tested for. Results: The
experiment suggests: 1) Benford’s Law is observed in the weights of neural networks, and in most
cases, the DBLE increases as the training progresses. 2) It is observed that models with superior
performance also tend to exhibit a higher DBLE.

URL: https://openreview.net/forum?id=Mjf07WLztZ

---

Title: Investigating the Nature of 3D Generalization in Deep Neural Networks

Abstract: Visual object recognition systems need to generalize from a set of 2D training views to novel views. The question of how the human visual system can generalize to novel views has been studied and modeled in psychology, computer vision, and neuroscience. Modern deep learning architectures for object recognition generalize well to novel views, but the mechanisms are not well understood. In this paper, we characterize the ability of common deep learning architectures to generalize to novel views. We formulate this as a supervised classification task where labels correspond to unique 3D objects and examples correspond to 2D views of the objects at different 3D orientations. We consider three common models of generalization to novel views: (i) full 3D generalization, (ii) pure 2D matching, and (iii) matching based on a linear combination of views. We find that deep models generalize well to novel views, but they do so in a way that differs from all these existing models. Extrapolation to views beyond the range covered by views in the training set is limited, and extrapolation to novel rotation axes is even more limited, implying that the networks do not infer full 3D structure, nor use linear interpolation. Yet, generalization is far superior to pure 2D matching. These findings help with designing datasets with 2D views required to achieve 3D generalization.

URL: https://openreview.net/forum?id=yzi2liOfpc

---

Title: A Unified View on Solving Objective Mismatch in Model-Based Reinforcement Learning

Abstract: Model-based Reinforcement Learning (MBRL) aims to make agents more sample-efficient, adaptive, and explainable by learning an explicit model of the environment. While the capabilities of MBRL agents have significantly improved in recent years, how to best learn the model is still an unresolved question. The majority of MBRL algorithms aim at training the model to make accurate predictions about the environment and subsequently using the model to determine the most rewarding actions. However, recent research has shown that model predictive accuracy is often not correlated with action quality, tracing the root cause to the objective mismatch between accurate dynamics model learning and policy optimization of rewards. A number of interrelated solution categories to the objective mismatch problem have emerged as MBRL continues to mature as a research area. In this work, we provide an in-depth survey of these solution categories and propose a taxonomy to foster future research.

URL: https://openreview.net/forum?id=tQVZgvXhZb

---

Title: Learning Personalized Decision Support Policies

Abstract: Individual human decision-makers may benefit from different forms of support to improve decision outcomes, but \textit{which} form of support will yield better outcomes? In this work, we propose the general problem of learning a \textit{decision support policy} that, for a given input, chooses which form of support to provide to decision-makers for whom we initially have no prior information. Using techniques from stochastic contextual bandits, we introduce \texttt{THREAD}, an online algorithm to personalize a decision support policy for each decision-maker. We further propose a variant of \texttt{THREAD} for the multi-objective setting to account for auxiliary objectives like the cost of support. We find that \texttt{THREAD} can learn a personalized policy that outperforms offline policies, and, in the cost-aware setting, reduce the incurred cost with minimal degradation to performance. Our experiments include various realistic forms of support (e.g., expert consensus and predictions from a large language model) on vision and language tasks. We deploy \texttt{THREAD} with real users to show how personalized policies can be learned online and illustrate nuances of learning decision support policies in practice.

URL: https://openreview.net/forum?id=2bLx0ZdqLs

---

Title: Unleashing the Potential of Acquisition Functions in High-Dimensional Bayesian Optimization

Abstract: Bayesian optimization (BO) is widely used to optimize expensive-to-evaluate black-box functions. It first builds a surrogate for the objective and quantifies its uncertainty. It then decides where to sample by maximizing an acquisition function (AF) defined by the surrogate model. However, when dealing with high-dimensional problems, finding the global maximum of the AF becomes increasingly challenging. In such cases, the manner in which the AF maximizer is initialized plays a pivotal role. An inappropriate initialization can severely limit the potential of AF.

This paper investigates a largely understudied problem concerning the impact of AF maximizer initialization on exploiting AFs' capability. Our large-scale empirical study shows that the widely used random initialization strategy may fail to harness the potential of an AF. Based on this observation, we propose a better initialization approach by employing multiple heuristic optimizers to leverage the historical data of black-box optimization to generate initial points for an AF maximizer. We evaluate our approach with a variety of heavily studied synthetic test functions and real-world applications. Experimental results show that our techniques, while simple, can significantly enhance the standard BO and outperform state-of-the-art methods by a large margin in most test cases.

URL: https://openreview.net/forum?id=0CM7Hfsy61

---

Title: Learning from Natural Language Feedback

Abstract: The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the target distribution and demonstrate proof-of-concepts on text summarization and program synthesis tasks. For code generation, ILF improves a Codegen-Mono 6.1B model's pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python Problems (MBPP) benchmark, outperforming both fine-tuning on MBPP and fine-tuning on repaired programs written by humans. For summarization, we show that ILF can be combined with learning from human preferences to improve a GPT-3 model's summarization performance to be comparable to human quality, outperforming fine-tuning on human-written summaries. Overall, our results suggest that learning from human-written natural language feedback is both more effective and sample-efficient than training exclusively on demonstrations for improving an LLM's performance on a variety of tasks.

URL: https://openreview.net/forum?id=xo3hI5MwvU

---

Title: Group Robustness via Discounted Rank Upweighting

Abstract: Recent work has shown that standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on underrepresented groups due to the prevalence of spurious features. A predominant approach to tackle this group robustness problem minimizes the worst group error (akin to a {\it minimax} strategy) on the training data with the expectation that it will generalize well to unseen test data. However, this is often suboptimal, especially when the out-of-distribution (OOD) test data contains previously unseen groups. Inspired by ideas from the information retrieval and learning-to-rank literature, this paper first proposes to use Discounted Cumulative Gain (DCG) as a metric of model quality for facilitating better hyperparameter tuning and model selection. Being a ranking-based metric, DCG weights multiple poorly-performing groups (instead of considering just the group with the worst performance). As a natural next step, we build on our results to propose a ranking-based training method called \textbf{Discounted Rank Upweighting (DRU)} which differentially reweights a ranked list of poorly-performing groups in the training data to learn models that exhibit strong OOD performance on the test data. Results on several synthetic and real-world datasets highlight the superior generalization ability of our group-ranking-based (akin to {\it soft-minimax}) approach in selecting and learning models that are robust to group distributional shifts.

URL: https://openreview.net/forum?id=AFgntSak7H

---

Title: High-dimensional Bayesian Optimization via Covariance Matrix Adaptation Strategy

Abstract: Bayesian Optimization (BO) is an effective method for finding the global optimum of expensive black-box functions. However, it is well known that applying BO to high-dimensional optimization problems is challenging. To address this issue, a promising solution is to use a local search strategy that partitions the search domain into local regions with high likelihood of containing the global optimum, and then use BO to optimize the objective function within these regions. In this paper, we propose a novel technique for defining the local regions using the Covariance Matrix Adaptation (CMA) strategy. Specifically, we use CMA to learn a search distribution that can estimate the probabilities of data points being the global optimum of the objective function. Based on this search distribution, we then define the local regions consisting of data points with high probabilities of being the global optimum. Our approach serves as a meta-algorithm as it can incorporate existing black-box BO optimizers, such as BO, TuRBO, and BAxUS, to find the global optimum of the objective function within our derived local regions. We evaluate our proposed method on various benchmark synthetic and real-world problems. The results demonstrate that our method outperforms existing state-of-the-art techniques.

URL: https://openreview.net/forum?id=eTgxr7gPuU

---

Title: Distributional GFlowNets with Quantile Flows

Abstract: Generative Flow Networks (GFlowNets) are a new family of probabilistic samplers where an agent learns a stochastic policy for generating complex combinatorial structure through a series of decision-making steps.
There have been recent successes in applying GFlowNets to a number of practical domains where diversity of the solutions is crucial, while reinforcement learning aims to learn an optimal solution based on the given reward function only and fails to discover diverse and high-quality solutions.
However, the current GFlowNet framework is relatively limited in its applicability and cannot handle stochasticity in the reward function.
In this work, we adopt a distributional paradigm for GFlowNets, turning each flow function into a distribution, thus providing more informative learning signals during training.
By parameterizing each edge flow through their quantile functions, our proposed \textit{quantile matching} GFlowNet learning algorithm is able to learn a risk-sensitive policy, an essential component for handling scenarios with risk uncertainty.
Moreover, we find that the distributional approach can achieve substantial improvement on existing benchmarks compared to prior methods due to our enhanced training algorithm, even in settings with deterministic rewards.

URL: https://openreview.net/forum?id=vFSsRYGpjW

---

Title: In search of projectively equivariant networks

Abstract: Equivariance of linear neural network layers is well studied.
In this work, we relax the equivariance condition to only be true in a projective sense.
Hereby, we introduce the topic of projective equivariance to the machine learning audience.
We theoretically study the relation of projectively and linearly equivariant linear layers. We find that in some important cases, surprisingly, the two types of layers coincide.
We also propose a way to construct a projectively equivariant neural network, which boils down to building a standard equivariant network where the linear group representations acting on each intermediate feature space are lifts of projective group representations.
Projective equivariance is showcased in two simple experiments. Code for the experiments is provided in the supplementary material.

URL: https://openreview.net/forum?id=Ls1E16bTj8

---

Title: Discovery of Hierarchy in Embedding Space

Abstract: Existing learning models partition the generated representations using linear hyperplanes which form well defined groups of similar embeddings that can be uniquely mapped to a particular class. However, in practical applications, the embedding space do not form distinct boundaries to segregate the clusters. Moreover, the structure of the latent space remains obscure. As learned representations are frequently reused to reduce the inference time, it is important to analyse how semantically related classes interact among themselves in the latent space. We have proposed a cluster growing algorithm that minimises the inclusion of other classes in the embedding space to form clusters of similar representations. These clusters are overlapping to denote ambiguous embeddings that cannot be mapped to a particular class with high confidence. Later, we construct relation trees to evaluate our method with the WordNet hierarchy using phylogenetic tree comparison methods.

URL: https://openreview.net/forum?id=LXNFjUylOl

---

Title: Optical Transformers

Abstract: The rapidly increasing size of deep-learning models has renewed interest in alternatives to digital-electronic computers as a means to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for them. In this paper, we investigate---through a combination of simulations and experiments on prototype optical hardware---the feasibility and potential energy benefits of running Transformer models on future optical accelerators that perform matrix-vector multiplication.

We use simulations, with noise models validated by small-scale optical experiments, to show that optical accelerators for matrix-vector multiplication should be able to accurately run a typical Transformer architecture model for language processing. We demonstrate that optical accelerators can achieve the same (or better) perplexity as digital-electronic processors at 8-bit precision, provided that the optical hardware uses sufficiently many photons per inference, which translates directly to a requirement on optical energy per inference. We studied numerically how the requirement on optical energy per inference changes as a function of the Transformer width $d$ and found that the optical energy per multiply--accumulate (MAC) scales approximately as $\frac{1}{d}$, giving an asymptotic advantage over digital systems.

We also analyze the total system energy costs for optical accelerators running Transformers, including both optical and electronic costs, as a function of model size. We predict that well-engineered, large-scale optical hardware should be able to achieve a $100 \times$ energy-efficiency advantage over current digital-electronic processors in running some of the largest current Transformer models, and if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical accelerators could have a $>8,000\times$ energy-efficiency advantage. Under plausible assumptions about future improvements to electronics and Transformer quantization techniques (5× cheaper memory access, double the digital--analog conversion efficiency, and 4-bit precision), we estimate that the energy advantage for optical processors versus electronic processors operating at 300~fJ/MAC could grow to $>100,000\times$.

URL: https://openreview.net/forum?id=Xxw0edFFQC

---

Title: Control, Transport and Sampling: The Benefit of a Reference Process

Abstract: We aim to establish connections between diffusion-based sampling, optimal transport, and optimal (stochastic) control through their shared links to the Schroedinger bridge problem. Throughout, we highlight the importance of having a reference measure on the path space for the design of a valid objective function that can be used to transport $\nu$ to $\mu$, consequently sample from the target $\mu$, via (optimally) controlled dynamics.

URL: https://openreview.net/forum?id=MhQCbsxOcw

---

Title: How does over-squashing affect the power of GNNs?

Abstract: Graph Neural Networks (GNNs) are the state-of-the-art model for machine learning on graph-structured data. The most popular class of GNNs operate by exchanging information between adjacent nodes, and are known as Message Passing Neural Networks (MPNNs). While understanding the expressive power of MPNNs is a key question, existing results typically consider settings with uninformative node features. In this paper, we provide a rigorous analysis to determine which function classes of node features can be learned by an MPNN of a given capacity. We do so by measuring the level of *pairwise interactions* between nodes that MPNNs allow for. This measure provides a novel quantitative characterization of the so-called over-squashing effect, which is observed to occur when a large volume of messages is aggregated into fixed-size vectors. Using our measure, we prove that, to guarantee sufficient communication between pairs of nodes, the capacity of the MPNN must be large enough, depending on properties of the input graph structure, such as commute times. For many relevant scenarios, our analysis results in impossibility statements in practice, showing that *over-squashing hinders the expressive power of MPNNs*. Our theory also holds for geometric graphs and hence extends to equivariant MPNNs on point clouds. We validate our analysis through extensive controlled experiments and ablation studies.

URL: https://openreview.net/forum?id=KJRoQvRWNs

---

Title: Continual HyperTransformer: A Meta-Learner for Continual Few-Shot Learning

Abstract: We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes. We approach this problem using the recently published HyperTransformer (HT), a Transformer-based hypernetwork that generates specialized task-specific CNN weights directly from the support set. In order to learn from a continual sequence of tasks, we propose to recursively re-use the generated weights as input to the HT for the next task. This way, the generated CNN weights themselves act as a representation of previously learned tasks, and the HT is trained to update these weights so that the new task can be learned without forgetting past tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. We demonstrate that our proposed Continual HyperTransformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios.

URL: https://openreview.net/forum?id=zdtSqZnkx1

---

Title: Normalization Is All You Need: Understanding Layer-Normalized Federated Learning under Extreme Label Shift

Abstract: Layer normalization (LN) is a widely adopted deep learning technique especially in the era of foundation models. Recently, LN has been shown to be surprisingly effective in federated learning (FL) with non-i.i.d. data. However, exactly why and how it works remains mysterious. In this work, we reveal the profound connection between layer normalization and the label shift problem in federated learning. To understand layer normalization better in FL, we identify the key contributing mechanism of normalization methods in FL, called feature normalization (FN), which applies normalization to the latent feature representation before the classifier head. Although LN and FN do not improve expressive power, they control feature collapse and local overfitting to heavily skewed datasets, and thus accelerates global training. Empirically, we show that normalization leads to drastic improvements on standard benchmarks under extreme label shift. Moreover, we conduct extensive ablation studies to understand the critical factors of layer normalization in FL. Our results verify that FN is an essential ingredient inside LN to significantly improve the convergence of FL while remaining robust to learning rate choices, especially under extreme label shift where each client has access to few classes.

URL: https://openreview.net/forum?id=6BDHUkSPna

---

Title: Imprecise Bayesian Neural Networks

Abstract: Uncertainty quantification and robustness to distribution shifts are important goals in machine learning and artificial intelligence. Although Bayesian Neural Networks (BNNs) allow for uncertainty in the predictions to be assessed, different sources of uncertainty are indistinguishable. We present Imprecise Bayesian Neural Networks (IBNNs); they generalize and overcome some of the drawbacks of standard BNNs. These latter are trained using a single prior and likelihood distributions, whereas IBNNs are trained using credal prior and likelihood sets. They allow to distinguish between aleatoric and epistemic uncertainties, and to quantify them. In addition, IBNNs are more robust than BNNs to prior and likelihood misspecification, and to distribution shift. They can also be used to compute sets of outcomes that enjoy probabilistic guarantees. We apply IBNNs to two case studies. One, for motion prediction in autonomous driving scenarios, and two, to model blood glucose and insulin dynamics for artificial pancreas control. We show that IBNNs performs better when compared to an ensemble of BNNs benchmark.

URL: https://openreview.net/forum?id=bolsjmDleF

---

Reply all

Reply to author

Forward

0 new messages