Weekly TMLR digest for Jun 25, 2023

8 views
Skip to first unread message

TMLR

unread,
Jun 24, 2023, 8:00:12 PM6/24/23
to tmlr-annou...@googlegroups.com

Accepted papers
===============


Title: Training with Mixed-Precision Floating-Point Assignments

Authors: Wonyeol Lee, Rahul Sharma, Alex Aiken

Abstract: When training deep neural networks, keeping all tensors in high precision (e.g., 32-bit or even 16-bit floats) is often wasteful. However, keeping all tensors in low precision (e.g., 8-bit floats) can lead to unacceptable accuracy loss. Hence, it is important to use a precision assignment—a mapping from all tensors (arising in training) to precision levels (high or low)—that keeps most of the tensors in low precision and leads to sufficiently accurate models. We provide a technique that explores this memory-accuracy tradeoff by generating precision assignments for convolutional neural networks that (i) use less memory and (ii) lead to more accurate convolutional networks at the same time, compared to the precision assignments considered by prior work in low-precision floating-point training. We evaluate our technique on image classification tasks by training convolutional networks on CIFAR-10, CIFAR-100, and ImageNet. Our method typically provides > 2× memory reduction over a baseline precision assignment while preserving training accuracy, and gives further reductions by trading off accuracy. Compared to other baselines which sometimes cause training to diverge, our method provides similar or better memory reduction while avoiding divergence.


URL: https://openreview.net/forum?id=ZoXi7n54OB

---

Title: Bandwidth Enables Generalization in Quantum Kernel Models

Authors: Abdulkadir Canatar, Evan Peters, Cengiz Pehlevan, Stefan M. Wild, Ruslan Shaydulin

Abstract: Quantum computers are known to provide speedups over classical state-of-the-art machine learning methods in some specialized settings. For example, quantum kernel methods have been shown to provide an exponential speedup on a learning version of the discrete logarithm problem. Understanding the generalization of quantum models is essential to realizing similar speedups on problems of practical interest. Recent results demonstrate that generalization is hindered by the exponential size of the quantum feature space. Although these results suggest that quantum models cannot generalize when the number of qubits is large, in this paper we show that these results rely on overly restrictive assumptions. We consider a wider class of models by varying a hyperparameter that we call quantum kernel bandwidth. We analyze the large-qubit limit and provide explicit formulas for the generalization of a quantum model that can be solved in closed form. Specifically, we show that changing the value of the bandwidth can take a model from provably not being able to generalize to any target function to good generalization for well-aligned targets. Our analysis shows how the bandwidth controls the spectrum of the kernel integral operator and thereby the inductive bias of the model. We demonstrate empirically that our theory correctly predicts how varying the bandwidth affects generalization of quantum models on challenging datasets, including those far outside our theoretical assumptions. We discuss the implications of our results for quantum advantage in machine learning.

URL: https://openreview.net/forum?id=A1N2qp4yAq

---

Title: Privacy-Preserving Energy-Based Generative Models for Marginal Distribution Protection

Authors: Robert E. Tillman, Tucker Balch, Manuela Veloso

Abstract: We consider learning generative models for sensitive financial and healthcare data. While previous work incorporates Differential Privacy (DP) into GAN training to protect the privacy of individual training instances, we consider a different privacy context where the primary objective is protecting the privacy of sensitive marginal distributions of the true generative process. We propose and motivate a new notion of privacy: \emph{$\alpha$-Level Marginal Distribution Privacy} ($\alpha$-LMDP), which provides a statistical guarantee that the sensitive generative marginal distributions are different from the observed real data. We then propose \emph{Privacy-Preserving Energy Models (PPEMs)}, a novel energy-based generative model formulation where the representations for these attributes are isolated from other attributes. This structured formulation motivates a learning procedure where a penalty based on a statistical goodness of fit test, the \emph{Kernel Stein Discrepancy}, can be applied to only the attributes requiring privacy so that $\alpha$-LMDP may be satisfied without affecting the other attributes. We evaluate this approach using financial and healthcare datasets and demonstrate that the resulting learnt generative models produce high fidelity synthetic data while preserving privacy. We also show that PPEMs can incorporate both $\alpha$-LMDP \emph{and} DP in contexts where both forms of privacy are required.

URL: https://openreview.net/forum?id=vTsfup5ll6

---

Title: Unsupervised Discovery and Composition of Object Light Fields

Authors: Cameron Omid Smith, Hong-Xing Yu, Sergey Zakharov, Fredo Durand, Joshua B. Tenenbaum, Jiajun Wu, Vincent Sitzmann

Abstract: Neural scene representations, both continuous and discrete, have recently emerged as a powerful new paradigm for 3D scene understanding. Recent efforts have tackled unsupervised discovery of object-centric neural scene representations. However, the high cost of ray-marching, exacerbated by the fact that each object representation has to be ray-marched separately, leads to insufficiently sampled radiance fields and thus, noisy renderings, poor framerates, and high memory and time complexity during training and rendering. Here, we propose to represent objects in an object-centric, compositional scene representation as light fields. We propose a novel light field compositor module that enables reconstructing the global light field from a set of object-centric light fields. Dubbed Compositional Object Light Fields (COLF), our method enables unsupervised learning of object-centric neural scene representations, state-of-the-art reconstruction and novel view synthesis performance on standard datasets, and rendering and training speeds at orders of magnitude faster than existing 3D approaches.

URL: https://openreview.net/forum?id=B7PFZtm8DA

---

Title: A Kernel Perspective on Behavioural Metrics for Markov Decision Processes

Authors: Pablo Samuel Castro, Tyler Kastner, Prakash Panangaden, Mark Rowland

Abstract: We present a novel perspective on behavioural metrics for Markov decision processes via the use of positive definite kernels. We define a new metric under this lens that is provably equivalent to the recently introduced MICo distance (Castro et al., 2021). The kernel perspective enables us to provide new theoretical results, including value-function bounds and low-distortion finite-dimensional Euclidean embeddings, which are crucial when using behavioural metrics for reinforcement learning representations. We complement our theory with strong empirical results that demonstrate the effectiveness of these methods in practice.

URL: https://openreview.net/forum?id=nHfPXl1ly7

---

Title: Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios

Authors: Xueying Zhan, Zeyu Dai, Qingzhong Wang, Qing Li, Haoyi Xiong, Dejing Dou, Antoni B. Chan

Abstract: Pool-based Active Learning (AL) has proven successful in minimizing labeling costs by sequentially selecting the most informative unlabeled data from large pool and querying their labels from an oracle or annotators. However, existing AL sampling schemes may not perform well in out-of-distribution (OOD) data scenarios, where the unlabeled data pool contains samples that do not belong to the pre-defined categories of the target task. Achieving strong AL performance under OOD data scenarios presents a challenge due to the inherent conflict between AL sampling strategies and OOD data detection. For instance, both more informative in-distribution (ID) data and OOD data in an unlabeled data pool would be assigned high informativeness scores (e.g., high entropy) during AL processes. To address this dilemma, we propose a Monte-Carlo Pareto Optimization for Active Learning (POAL) sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool. We formulate the AL sampling task as a multi-objective optimization problem and employ Pareto optimization based on two conflicting objectives: (1) the conventional AL sampling scheme (e.g., maximum entropy) and (2) the confidence of excluding OOD data samples. Experimental results demonstrate the effectiveness of our POAL approach on classical Machine Learning (ML) and Deep Learning (DL) tasks.

URL: https://openreview.net/forum?id=dXnccpSSYF

---

Title: Predicting Out-of-Domain Generalization with Neighborhood Invariance

Authors: Nathan Hoyen Ng, Neha Hulkund, Kyunghyun Cho, Marzyeh Ghassemi

Abstract: Developing and deploying machine learning models safely depends on the ability to char- acterize and compare their abilities to generalize to new environments. Although recent work has proposed a variety of methods that can directly predict or theoretically bound the generalization capacity of a model, they rely on strong assumptions such as matching train/test distributions and access to model gradients. In order to characterize generalization when these assumptions are not satisfied, we propose neighborhood invariance, a measure of a classifier’s output invariance in a local transformation neighborhood. Specifically, we sample a set of transformations and given an input test point, calculate the invariance as the largest fraction of transformed points classified into the same class. Crucially, our measure is simple to calculate, does not depend on the test point’s true label, makes no assumptions about the data distribution or model, and can be applied even in out-of-domain (OOD) settings where existing methods cannot, requiring only selecting a set of appropriate data transformations. In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our neighborhood invariance measure and actual OOD generalization on over 4,600 models evaluated on over 100 train/test domain pairs.


URL: https://openreview.net/forum?id=jYkWdJzTwn

---


New submissions
===============


Title: Multi-Domain Long-Tailed Learning by Augmenting Disentangled Representations

Abstract: There is an inescapable long-tailed class-imbalance issue in many real-world classification problems. Current methods for addressing this problem only consider scenarios where all examples come from the same distribution. However, in many cases, there are multiple domains with distinct class imbalance. We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains. Towards that goal, we introduce TALLY, a method that addresses this multi-domain long-tailed learning problem. Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another, producing a new representation for use as data augmentation. To improve the disentanglement of semantic representations, TALLY further utilizes a domain-invariant class prototype that averages out domain-specific effects. We evaluate TALLY on several benchmarks and real-world datasets and find that it consistently outperforms other state-of-the-art methods in both subpopulation and domain shift.


URL: https://openreview.net/forum?id=4UXJhNSbwd

---

Title: DreamEdit: Subject-driven Image Editing

Abstract: Subject-driven image generation aims at generating images containing customized subjects, which has recently drawn enormous attention from the research community. Nevertheless, the previous works cannot precisely control the background and position of the target subject. In this work, we aspire to fill the void of the existing subject-driven generation tasks. To this end, we propose two novel subject-driven editing sub-tasks, i.e., Subject Replacement and Subject Addition. The new tasks are challenging in multiple aspects: replacing a subject with a customized one can totally change its shape, texture, and color, while adding a target subject to a designated position in a provided scene necessitates a rational context-aware posture of the subject. To conquer these two novel tasks, we first manually curate a new dataset called DreamEditBench containing 22 different types of subjects, and 440 source images, which cover diverse scenarios with different difficulty levels. We plan to host DreamEditBench as a platform and hire trained evaluators for standardized human evaluation. We also devise an innovative method DreamEditor to resolve these tasks by performing iterative generation, which enables a smooth adaptation to the customized subject. In this project, we conduct automatic and human evaluations to understand the performance of our DreamEditor and baselines on DreamEditBench. We found that the new tasks are challenging for the existing models. For Subject Replacement, we found that the existing models are particularly sensitive to the shape and color of the original subject. When the original subject and the customized subject are highly different, the model failure rate will dramatically increase. For Subject Addition, we found that the existing models cannot easily blend the customized subjects into the background smoothly, which causes noticeable artifacts in the generated image. We hope that DreamEditBench can become a standardized platform to enable future investigations towards building more controllable subject-driven image editing.

URL: https://openreview.net/forum?id=P9haooN9v2

---

Title: Trusting the Untrustworthy: A Cautionary Tale on the Pitfalls of Training-based Rejection Option

Abstract: We consider the problem of selective classification, also known as rejection option. We first analyze state-of-the-art methods that involve a training phase to produce a selective classifier capable of determining when it should abstain from making a decision. Although only some of these frameworks require changes to the basic architecture of the classifier, by adding a module for selection, all methods necessitate implementing modifications to the standard training procedure and loss function for classification. Crucially, we observe two types of limitations affecting these methods: on the one side, these methods exhibit poor performance in terms of selective risk and coverage over some classes, which are not necessarily the hardest to classify; and surprisingly, on the other side, the classes for which they attain low performance vary with the model initialization. Additionally, some of these methods also decrease the accuracy of the final classification. We discuss the limitations of each framework, demonstrating that these shortcomings occur for a wide range of models and datasets. Moreover, we establish a formal connection between the trade-off of detecting misclassification errors and the minimization of risks in selective classification. This connection enables the development of a testing framework that requires no training and can be seamlessly applied to any pre-trained classifier, thereby enabling a rejection option.

URL: https://openreview.net/forum?id=Hod08tayZZ

---

Title: Less greedy equation learning: Balancing interpretability and expressivity through Bayesian model selection

Abstract: In the field of equation learning, exhaustively considering all possible combinations derived from a basis function dictionary is infeasible. Sparse regression and greedy algorithms have emerged as popular approaches to tackle this challenge. However, the presence of strong collinearities poses difficulties for sparse regression techniques, and greedy steps may inadvertently exclude important components of the true equation, leading to reduced identification accuracy. In this article, we present a novel algorithm that strikes a balance between comprehensiveness and efficiency in equation learning. Inspired by stepwise regression, our approach combines the coefficient of determination, $R^2$, and the Bayesian model evidence, $p(y|\M)$, in a novel way. Through three extensive numerical experiments involving random polynomials and dynamical systems, we compare our method against two standard approaches, four state-of-the-art methods, and bidirectional stepwise regression incorporating $p(y|\M)$. The results demonstrate that our less greedy algorithm surpasses all other methods in terms of identification accuracy. Furthermore, we discover a heuristic approach to mitigate the overfitting penalty associated with $R^2$ and propose an equation learning procedure solely based on $R^2$, which achieves high rates of exact equation recovery.

URL: https://openreview.net/forum?id=0ck7hJ8EVC

---

Title: Multiscale Causal Structure Learning

Abstract: Causal structure learning methods are vital for unveiling causal relationships embedded into observed data. However, the state of the art suffers a major limitation: it assumes that causal interactions occur only at the frequency at which data is observed. To address this limitation, this paper proposes a method that allows structural learning of linear causal relationships occurring at different time scales. Specifically, we explicitly take into account instantaneous and lagged inter-relations between multiple time series, represented at different scales, hinging on wavelet transform. We cast the problem as the learning of a multiscale causal graph having sparse structure and dagness constraints, enforcing causality through directed and acyclic topology. To solve the resulting (non-convex) formulation, we propose an algorithm termed MS-CASTLE, which exhibits consistent performance across different noise distributions and wavelet choices. We also propose a single-scale version of our algorithm, SS-CASTLE, which outperforms existing methods in computational efficiency, performance, and robustness on synthetic data. Finally, we apply the proposed approach to learn the multiscale causal structure of the risk of 15 global equity markets, during covid-19 pandemic, illustrating the importance of multiscale analysis to reveal useful interactions at different time resolutions. Financial investors can leverage our approach to manage risk within equity portfolios from a causal perspective, tailored to their investment horizon.

URL: https://openreview.net/forum?id=Ub6XILEF9x

---

Title: Federated Learning under Partially Disjoint Data via Manifold Reshaping

Abstract: Statistical heterogeneity severely limits the performance of federated learning (FL), motivating several explorations e.g., FedProx, MOON and FedDyn, to alleviate this problem. Despite effectiveness, their considered scenario generally requires samples from almost all classes during the local training of each client, although some covariate shifts may exist among clients. In fact, the natural case of partially disjoint data (PDD), where each client contributes a few classes (instead of all classes) of samples, is practical yet underexplored. Specifically, the unique collapse and invasion characteristics of PDD can induce the biased optimization direction in local training, which prevents the efficiency of federated learning. To address this dilemma, we propose a manifold reshaping approach called FedMR to calibrate the feature space of local training. Our FedMR adds two interplaying losses to the vanilla federated learning: one is the intra-class loss to decorrelate feature dimensions for anti-collapse; and the other one is the inter-class loss to guarantee the proper margin among categories in the feature expansion. We conduct extensive experiments on a range of datasets to demonstrate that our FedMR achieves much higher accuracy and better communication efficiency.

URL: https://openreview.net/forum?id=jLJTqJXAG7

---

Title: Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Abstract: Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is the state-of-art method for many of these tasks. CoT uses language models to produce text describing reasoning, and computation, and finally the answer to a question. Here we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to generate text and programming language statements, and finally an answer. In PoT, the computation can be
delegated to a program interpreter, which is used to execute the generated program, thus decoupling complex computation from reasoning and language understanding. We evaluate PoT on five math word problem datasets and three financial-QA datasets in both few-shot and zero-shot settings. We find that PoT has an average performance gain over CoT of around 12% across all datasets.
By combining PoT with self-consistency decoding, we can achieve extremely strong performance on all the math datasets and financial datasets.
All of our data and code will be released.

URL: https://openreview.net/forum?id=YfZ4ZPt8zd

---

Title: EM-Paste: EM-guided Cut-Paste for Image-level Weakly Supervised Instance Segmentation

Abstract: We propose EM-Paste: an Expectation Maximization (EM) guided Cut-Paste compositional dataset augmentation approach for weakly-supervised instance segmentation using only image-level supervision. The proposed method consists of three main components. The
first component generates high-quality foreground object masks. To this end, an EM-like approach is proposed that iteratively refines an initial set of object mask proposals generated by a generic region proposal method. Next, in the second component, high-quality
context-aware background images are generated using a text-to-image compositional synthesis method like DALL·E. Finally, the third component creates a large-scale pseudo-labeled instance segmentation training dataset by compositing the foreground object masks onto the original and generated background images. The proposed approach achieves state-of-the-art weakly-supervised instance segmentation results on both the PASCAL VOC 2012 and MS COCO datasets by using only image-level, weak label information. In particular, it outperforms the best baseline by +7.4 and +2.8 mAP0.50 on PASCAL and COCO, respectively. Further, the method provides a new solution to the long-tail weakly-supervised instance segmentation problem (when many classes may only have few training samples),
by selectively augmenting under-represented classes.

URL: https://openreview.net/forum?id=vpTtyiIBGt

---

Title: Weight-balancing fixes and flows for deep learning

Abstract: Feedforward neural networks with homogeneous activation functions possess a gauge symmetry: the functions they compute do not change when the incoming and outgoing weights at any hidden unit are rescaled by reciprocal positive values. This paper makes two contributions to our understanding of these networks. The first is to describe a simple procedure for gauge-fixing: this procedure computes multiplicative rescaling factors—one at each hidden unit—that rebalance the weights of these networks without changing the end-to-end functions that they compute. Specifically, given an initial network with arbitrary weights, the procedure determines the functionally equivalent network whose weight matrix is of minimal $\ell_{p,q}$-norm; the weights at each hidden unit are said to be balanced when this norm is stationary with respect to rescaling transformations. The optimal rescaling factors are computed in an iterative fashion via simple multiplicative updates, and the updates are notable in that (a) they do not require the tuning of learning rates, (b) they operate in parallel on the rescaling factors at all hidden units, and (c) they converge monotonically to a global minimizer of the $\ell_{p,q}$-norm. The paper's second contribution is to analyze the optimization landscape for learning in these networks. We suppose that the network's loss function consists of two terms—one that is invariant to rescaling transformations, measuring predictive accuracy, and the other (a regularizer) that breaks this invariance, penalizing large weights. We show how to derive a weight-balancing flow such that the regularizer remains minimal with respect to rescaling transformations as the weights descend in the loss function. This weight-balancing flow reduces to an ordinary gradient flow for $\ell_2$-norm regularization, but not otherwise. Though gradient flow serves as a conceptual foundation for deep learning, our analysis suggests a canonical pairing of alternative flows and regularizers.

URL: https://openreview.net/forum?id=uaHyXxyp2r

---

Title: An Option-Dependent Analysis of Regret Minimization Algorithms in Finite-Horizon Semi-MDP

Abstract: A large variety of real-world Reinforcement Learning (RL) tasks is characterized by a complex and heterogeneous structure that makes end-to-end (or flat) approaches hardly applicable or even infeasible. Hierarchical Reinforcement Learning (HRL) provides general solutions to address these problems thanks to a convenient multi-level decomposition of the tasks, making their solution accessible. Although often used in practice, few works provide theoretical guarantees to justify this outcome effectively. Thus, it is not yet clear when to prefer such approaches compared to standard flat ones. In this work, we provide an option-dependent upper bound to the regret suffered by regret minimization algorithms in finite-horizon problems. We illustrate that the performance improvement derives from the planning horizon reduction induced by the temporal abstraction enforced by the hierarchical structure. Then, focusing on a sub-setting of HRL approaches, the options framework, we highlight how the average duration of the available options affects the planning horizon and, consequently, the regret itself. Finally, we relax the assumption of having pre-trained options to show how, in particular situations, is still preferable a hierarchical approach over a standard one.

URL: https://openreview.net/forum?id=VP9p4u9jAo

---

Title: A probabilistic Taylor expansion with Gaussian processes

Abstract: We study a class of Gaussian processes for which the posterior mean, for a particular choice of data, replicates a truncated Taylor expansion of any order. The data consist of derivative evaluations at the expansion point and the prior covariance kernel belongs to the class of Taylor kernels, which can be written in a certain power series form. We discuss and prove some results on maximum likelihood estimation of parameters of Taylor kernels. The proposed framework is a special case of Gaussian process regression based on data that is orthogonal in the reproducing kernel Hilbert space of the covariance kernel.


URL: https://openreview.net/forum?id=2TneniEIDB

---

Title: Representations and Computations in Transformers that Support Generalization on Structured Tasks

Abstract: Transformers have shown remarkable success in natural language processing and computer vision, serving as the foundation of large language and multimodal models. These networks can capture nuanced context sensitivity across high-dimensional language tokens or image pixels, but it remains unclear how highly structured behavior and systematic generalization can arise in these systems. Here, we explore the solution process a causal transformer discovers as it learns to solve a set of algorithmic tasks involving copying, sorting, and hierarchical compositions of these operations. We search for the minimal layer and head configuration sufficient to solve these tasks and unpack the roles of the attention heads, as well as how token representations are reweighted across layers to complement these roles. Our results provide new insights into how attention layers in transformers support structured computation within and across tasks: 1) Replacing fixed position labels with labels sampled from a larger set enables strong length generalization. The learnable embeddings of these labels develop different representations, capturing sequence order if necessary, depending on task demand. 2) Two-layer transformers can learn reliable solutions to the multi-level problems we explore. The first layer tends to transform the input representation to allow the second layer to share computation across repeated components within a task or across related tasks. 3) We introduce an analysis pipeline that quantifies how the representation space in a given layer prioritizes different aspects of each item. We show that these representations prioritize information needed to guide attention relative to information that only requires downstream readout.

URL: https://openreview.net/forum?id=oFC2LAqS6Z

---

Title: Finterp: Cost-Time Analysis of Video Action Recognition using the Black Scholes Model

Abstract: We present a novel method to analyze the earliest instant of time at which a pretrained video action recognition neural network is capable of predicting the action class, with high confidence. We exploit the fact that this problem bears similarities with pricing options in a European stock market, consequentially, our approach, Finterp, is inspired by the Black Scholes model in finance. We formulate analogies between the conceptualization of the variables involved in the Black Scholes formula and video frames to derive the appropriate algorithm. We use Finterp to extensively analyze the prediction capabilities of the neural network over time, on multiple diverse datasets. Finterp reveals that optimal frames are concentrated at low instants of time for datasets with scene bias and mid instants of time for datasets with motion bias. We demonstrate that Finterp does not compromise on the confidence of action prediction in an attempt to minimize the length of video observed. The 'Black Scholes Accuracy' for state-of-the-art 3D CNNs such as I3D and X3D stands at $81-86\%$, $64\%$ and $25\%$ for Kinetics, UAV Human and Diving-48 respectively, revealing the need to develop neural networks that can learn unique temporal signatures for various actions. Finally, we extend Finterp to make optimal time instant predictions at the hierarchical level, where similar action classes are grouped together, and show that the optimal time instant predictions are at earlier time instants than the corresponding predictions without hierarchy. We will make all code publicly available.

URL: https://openreview.net/forum?id=crab4LLZj0

---

Title: Directional Privacy for Deep Learning

Abstract: Differentially Private Stochastic Gradient Descent (DP-SGD) is a key method for applying privacy in the training of deep learning models. This applies isotropic Gaussian noise to gradients during training, which can perturb these gradients in any direction, damaging utility. Metric DP, however, can provide alternative mechanisms based on arbitrary metrics that might be more suitable for preserving utility. In this paper, we apply \textit{directional privacy}, via a mechanism based on the von Mises-Fisher (VMF) distribution, to perturb gradients in terms of \textit{angular distance} so that gradient direction is broadly preserved. We show that this provides both $\epsilon$-DP and $\epsilon d$-privacy for deep learning training, rather than the $(\epsilon, \delta)$-privacy of the Gaussian mechanism; we observe that the $\epsilon d$-privacy guarantee does not require a $\delta>0$ term but degrades smoothly according to the dissimilarity of the input gradients.

As $\epsilon$s between these different frameworks cannot be directly compared, we examine empirical privacy calibration mechanisms that go beyond previous work on empirically calibrating privacy within standard DP frameworks using membership inference attacks (MIA); we show that a combination of enhanced MIA and reconstruction attacks provides a suitable method for privacy calibration. Experiments on key datasets then indicate that the VMF mechanism can outperform the Gaussian in the utility-privacy trade-off. In particular, our experiments provide a direct comparison of privacy between the two approaches in terms of their ability to defend against reconstruction and membership inference.

URL: https://openreview.net/forum?id=uZcxGCHlSE

---

Title: Population-based Evaluation in Repeated Rock-Paper-Scissors as a Benchmark for Multiagent Reinforcement Learning

Abstract: Progress in fields of machine learning and adversarial planning has benefited significantly from benchmark domains, from checkers and the classic UCI data sets to Go and Diplomacy. In sequential decision-making, agent evaluation has largely been restricted to few interactions against experts, with the aim to reach some desired level of performance (e.g. beating a human professional player). We propose a benchmark for multiagent learning based on repeated play of the simple game Rock, Paper, Scissors along with a population of forty-three tournament entries, some of which are intentionally sub-optimal. We describe metrics to measure the quality of agents based both on average returns and exploitability. We then show that several RL, online learning, and language model approaches can learn good counter-strategies and generalize well, but ultimately lose to the top-performing bots, creating an opportunity for research in multiagent learning.

URL: https://openreview.net/forum?id=gQnJ7ODIAx

---

Title: Binary Classification under Label Differential Privacy Using Randomized Response Mechanisms

Abstract: Label differential privacy is a popular branch of $\epsilon$-differential privacy for protecting labels in training datasets with non-private features. In this paper, we study the generalization performance of a binary classifier trained on a dataset privatized under the label differential privacy achieved by the randomized response mechanism. Particularly, we establish minimax lower bounds for the excess risks of the deep neural network plug-in classifier, theoretically quantifying how privacy guarantee $\epsilon$ affects its generalization performance. Our theoretical result shows: (1) the randomized response mechanism slows down the convergence of excess risk by lessening the multiplicative constant term compared with the non-private case $(\epsilon=\infty)$; (2) as $\epsilon$ decreases, the optimal structure of the neural network should be smaller for better generalization performance; (3) the convergence of its excess risk is guaranteed even if $\epsilon$ is adaptive to the size of training sample $n$ at a rate slower than $O(n^{-1/2})$. Our theoretical results are validated by extensive simulated examples and two real applications.

URL: https://openreview.net/forum?id=uKCGOw9bGG

---

Title: Training Vision-Language Transformers from Captions

Abstract: Vision-Language Transformers can be learned without low-level human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes (Chen et al., 2020b; Tan & Bansal, 2019; Lu et al., 2019) or patches (Kim et al., 2021), assumes that the visual backbone must first be trained on ImageNet (Russakovsky et al., 2015) class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders (He et al., 2022) that does not require this supervision. In fact, in a head-to-head comparison between ViLT, a strong patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.

URL: https://openreview.net/forum?id=xLnbSpozWS

---

Title: Benefits of Max Pooling in Neural Networks: Theoretical and Experimental Evidence

Abstract: When deep neural networks became state of the art image classifiers, numerous max pooling operations were an important component of the architecture. However, modern computer vision networks typically have few if any max pooling operations. To understand
whether this trend is justified, we develop a mathematical framework
analyzing ReLU based approximations of max pooling, and prove a sense
in which max pooling cannot be replicated. We formulate and analyze a class of optimal
approximations, and find that residual can be made exponentially small in
the kernel size, but only with an exponentially wide approximation.

This work gives a theoretical basis for understanding the reduced use of
max pooling in newer architectures. Since max pooling does not seem necessary,
we conclude that empirically the inputs on which max pooling is distinct --
those with a large difference between the max and other values --is not
a pattern prevalent in natural images.

URL: https://openreview.net/forum?id=YgeXqrH7gA

---

Title: SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration

Abstract: The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations. For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution. It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of different components of our method.

URL: https://openreview.net/forum?id=JwGKVpRfVD

---

Title: Fast Kernel Methods for Generic Lipschitz Losses via $p$-Sparsified Sketches

Abstract: Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well-studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over state-of-the-art sketching methods.

URL: https://openreview.net/forum?id=ry2qgRqTOw

---

Title: Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs and Practical Solutions

Abstract: Offline reinforcement learning (RL) allows for the training of competent agents from offline datasets without any interaction with the environment. Online finetuning of such offline models can further improve performance. But how should we ideally finetune agents obtained from offline RL training? While offline RL algorithms can in principle be used for finetuning, in practice, their online performance improves slowly. In contrast, we show that it is possible to use standard online off-policy algorithms for faster improvement. However, we find this approach may suffer from policy collapse, where the policy undergoes severe performance deterioration during initial online learning. We investigate the issue of policy collapse and how it relates to data diversity, algorithm choices and online replay distribution. Based on these insights, we propose a conservative policy optimization procedure that can achieve stable and sample-efficient online learning from offline pretraining.

URL: https://openreview.net/forum?id=PffrUvAOGy

---

Title: Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

Abstract: With the increasing demand for large-scale training of machine learning models, fully decentralized optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker maintains a local estimate of the optimal parameter vector, and iteratively updates it by waiting and averaging all estimates obtained from its neighbors, and then corrects it on the basis of its local dataset. However, the synchronization phase is sensitive to stragglers. An efficient way to mitigate this effect is to consider asynchronous updates, where each worker computes stochastic gradients and communicates with other workers at its own pace. Unfortunately, fully asynchronous updates suffer from staleness of the stragglers' parameters. To address these limitations, we propose a fully decentralized algorithm DSGD-AAU with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with. We show that DSGD-AAU achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers). Experimental results on a suite of datasets and deep neural network models are provided to verify our theoretical results.

URL: https://openreview.net/forum?id=95VmiqPKAX

---

Title: \texttt{FedBC}: Federated Learning Beyond Consensus

Abstract: Federated learning (FL) algorithms, such as FedAvg/FedProx, commonly rely on the consensus constraint, enforcing local models to be equal to the global model obtained through the averaging of local updates. However, in practical FL settings with heterogeneous agents, we question the necessity of enforcing consensus. We empirically observe that relaxing consensus constraint an improve both local and global performance to a certain extent. To mathematically formulate it, we replace the consensus constraint in standard FL objective with the proximity between the local and the global model controlled by a tolerance parameter $\gamma$, and propose a novel Federated Learning Beyond Consensus (\texttt{FedBC}) algorithm to solve it.
Theoretically, we establish that \texttt{FedBC} converges to a first-order stationary point at rates that matches the state of the art, up to an additional error term that depends on a tolerance parameter $\gamma$.
Finally, we demonstrate that \texttt{FedBC} balances the global and local model test accuracy metrics across a suite of datasets (Synthetic, MNIST, CIFAR-10, Shakespeare), achieving competitive performance with state-of-the-art.

URL: https://openreview.net/forum?id=FCgSHlFnYT

---

Title: Overcoming Resource Constraints in Federated Learning: Large Models Can Be Trained with only Weak Clients

Abstract: Federated Learning (FL) is emerging as a popular, promising decentralized learning framework that enables collaborative training among clients, with no need to share private data between them or to a centralized server. However, considering many edge clients do not have sufficient computing, memory, or communication capabilities, federated learning of large models still faces significant bottlenecks. To keep such weak but crucial clients in the loop, prior works either consider a heterogeneous-client setting where clients train models with different sizes; or offload training to the server. However, the heterogeneous-client setting requires some clients to train full model, which is not aligned with the resource-constrained setting; while the latter ones break privacy promises in FL when sharing intermediate representations or labels with the server. To overcome these limitations, in this work, we formulate a realistic, but much less explored, cross-device FL setting in which no client can train a full large model nor is willing to share any intermediate information with the remote server. Under such a formulation, we develop a principal sub-model (PriSM) training methodology to collaboratively train a full large model, while assigning each client a small sub-model that is a probabilistic low-rank approximation to the full server model. When creating sub-models, PriSM first performs a principal kernel analysis in the orthogonal kernel space to obtain importance of each kernel. Then, PriSM adopts a novel importance-aware sampling process to select a subset of kernels (i.e., a kernel with high importance is assigned with a higher sampling probability). This sampling process ensures each sub-model is still a low-rank approximation to the full model, while all sub-models together achieve nearly full coverage on the principal kernels. To further improve memory efficiency while still preserving accuracy, PriSM also exploits low-rank structure in intermediate representations and allows each sub-model to learn only a subset of them. Our evaluations on various datasets and models (CNNs, LSTMs, Transformers) under different resource-constrained settings demonstrate that PriSM yields an accuracy improvement of up to $10\%$ compared to existing works. More importantly, PriSM does not incur significant accuracy degradation compared to full-model training (e.g., only $\sim 2\%$ accuracy drops for ResNet-18/CIFAR-10 when clients train only $0.2\times$ sub-models).

URL: https://openreview.net/forum?id=lx1WnkL9fk

---

Title: Rethinking Adversarial Training with A Simple Baseline

Abstract: We report competitive results on RobustBench for CIFAR and SVHN using a simple yet effective baseline approach. Our approach involves a training protocol that integrates rescaled square loss, cyclic learning rates, and erasing-based data augmentation. The outcomes we have achieved are comparable to those of the model trained with state-of-the-art techniques, which is currently the predominant choice for adversarial training. Our baseline, referred to as SimpleAT, yields three novel empirical insights. (i) By switching to square loss, the accuracy is comparable to that obtained by using both de-facto training protocol plus data augmentation. (ii) One cyclic learning rate is a good scheduler, which can effectively reduce the risk of robust overfitting. (iii) Employing rescaled square loss during model training can yield a favorable balance between adversarial and natural accuracy. In general, our experimental results show that SimpleAT effectively mitigates robust overfitting and consistently achieves the best performance at the end of training. For example, on CIFAR-10 with ResNet-18, SimpleAT achieves approximately 52% adversarial accuracy against the current strong AutoAttack. Furthermore, SimpleAT exhibits robust performance on various image corruptions, including those commonly found in CIFAR-10-C dataset. Finally, we assess the effectiveness of these insights through two techniques: bias-variance analysis and logit penalty methods. Our findings demonstrate that all of these simple techniques are capable of reducing the variance of model predictions, which is regarded as the primary contributor to robust overfitting. In addition, our analysis also uncovers connections with various
advanced state-of-the-art methods.

URL: https://openreview.net/forum?id=zOT0tu8Yu4

---

Title: Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD

Abstract: Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called ``tail-index”) in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic
and randomized stepsizes compared to constant stepsize in terms of the tail behavior. We illustrate our theory on linear regression experiments and show through deep learning experiments that Markovian stepsizes can achieve even a heavier tail and be a viable alternative to cyclic and i.i.d. randomized stepsize rules.

URL: https://openreview.net/forum?id=lNB5EHx8uC

---

Title: Diagnostic Tool for Out-of-Sample Model Evaluation

Abstract: Assessment of model fitness is a key part of machine learning. The standard paradigm of model evaluation is analysis of the average loss over future data. This is often explicit in model fitting, where we select models that minimize the average loss over training data as a surrogate, but comes with limited theoretical guarantees. In this paper, we consider the problem of characterizing a batch of out-of-sample losses of a model using a calibration data set. We provide finite-sample limits on the out-of-sample losses that are statistically valid under quite general conditions and propose a diagonistic tool that is simple to compute and interpret. Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyperparameter tuning.

URL: https://openreview.net/forum?id=Ulf3QZG9DC

---

Title: Automated Detection of Causal Inference Opportunities: Regression Discontinuity Subgroup Discovery

Abstract: The gold standard for the identification of causal effects are randomized controlled trials (RCT), but RCTs may not always be feasible to conduct.
When treatments depend on a threshold however, such as the blood sugar threshold for diabetes diagnosis, we can still sometimes estimate causal effects with regression discontinuities (RDs). RDs are valid when units just above and below the threshold have the same distribution of covariates and thus no confounding in the presence of noise, establishing an as-if randomization. In practice however, implementing RD studies can be difficult as identifying treatment thresholds require considerable domain expertise -- furthermore, the thresholds may differ across subgroups (e.g., the blood sugar threshold for diabetes may differ across demographics), and ignoring these differences can lower statistical power.
Finding the thresholds and to whom they apply is an important problem currently solved manually by domain experts, and data-driven approaches are needed when domain expertise is not sufficient.
Here, we introduce Regression Discontinuity SubGroup Discovery (RDSGD), a machine-learning method that identifies statistically powerful and interpretable subgroups for RD thresholds.
Using a medicial claims dataset with over 60 million patients, we apply RDSGD to multiple clinical contexts and identify subgroups with increased compliance to treatment assignment thresholds.
As treatment thresholds matter for many diseases and policy decisions, RDSGD can be a powerful tool for discovering new avenues for causal estimation.

URL: https://openreview.net/forum?id=cdRYoTyHZh

---

Title: Limitation of Characterizing Implicit Regularization by Data-independent Functions

Abstract: In recent years, understanding the implicit regularization of neural networks (NNs) has become a central task in deep learning theory. However, implicit regularization is itself not completely defined and well understood. In this work, we attempt to mathematically define
and study implicit regularization. Importantly, we explore the limitations of a common approach to characterizing implicit regularization using data-independent functions. We propose two dynamical mechanisms, i.e., Two-point and One-point Overlapping mechanisms,
based on which we provide two recipes for producing classes of one-hidden-neuron NNs that provably cannot be fully characterized by a type of or all data-independent functions. Our results signify the profound data dependency of implicit regularization in general, inspiring
us to study in detail the data dependency of NN implicit regularization in the future.

URL: https://openreview.net/forum?id=140kSqm0uy

---

Title: DP-LFlow: Differentially Private Latent Flow for Scalable Sensitive Image Generation

Abstract: Privacy concerns grow with the success of modern deep learning models, especially when the training set contains sensitive data. Differentially private generative model (DPGM) can serve as a solution to circumvent such concerns by generating data that are distributionally similar to the original data yet with differential privacy (DP) guarantees. While GAN has attracted major attention, existing DPGMs based on flow generative models are limited and only developed on low-dimensional tabular datasets. The capability of exact density estimation makes the flow model exceptional when density estimation is of interest. In this work, we will first show that it is challenging (or even infeasible) to train a DP-flow via DP-SGD, i.e. the workhorse algorithm for private deep learning, on high-dimensional image sets with acceptable utility, and then we give an effective solution by reducing the generation from the pixel space to a lower dimensional latent space. We show the effectiveness and scalability of the proposed method via extensive experiments, where the proposed method achieves a significantly better privacy-utility trade-off compared to existing alternatives. Notably, our method is the first DPGM to scale to high-resolution image sets (up to 256 × 256).

URL: https://openreview.net/forum?id=GEcneTl9Mk

---

Title: A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

Abstract: Hybrid self-supervised learning methods that combine masked image modelling and contrastive learning have demonstrated state-of-the-art performance across many vision tasks. In this work we identify a property overlooked by previous hybrid methods: they can achieve considerable efficiency improvements compared to contrastive learning, whilst still outperforming the constituent contrastive and masked image modelling training components. To demonstrate this, we introduce CAN a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. CAN is designed to be efficient, masking 50\% of patches in \emph{both} views, meaning that the overall FLOPs load of SimCLR is 70\% higher than CAN for ViT-L backbones. Our combined approach outperforms its MAE and SimCLR constituent parts on an extensive set of downstream transfer learning and robustness tasks under both linear probe and finetune protocols, and pre-training on large scale datasets such as JFT-300M and ImageNet-21K. Code is provided in the supplementary material, and will be publicly released.


URL: https://openreview.net/forum?id=pjdxPts6er

---

Title: Data pruning and neural scaling laws: fundamental limitations of score-based algorithms

Abstract: Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results (Guo, B. Zhao, and Bai, 2022) reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of 30% or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; see (Sorscher et al., 2022), where the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate “No Free Lunch" theorems for data pruning and discuss potential solutions to these limitations.

URL: https://openreview.net/forum?id=iRTL4pDavo

---

Title: Achieving the Pareto Frontier of Regret Minimization and Best Arm Identification in Multi-Armed Bandits

Abstract: We study the Pareto frontier of two archetypal objectives in multi-armed bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore that the balance between exploitation and exploration is crucial for both RM and BAI, but exploration is more critical in achieving the optimal performance for the latter objective. To this end, we design and analyze the BoBW-lil’UCB($\gamma$) algorithm. Complementarily, by establishing lower bounds on the regret achievable by any algorithm with a given BAI failure probability, we show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives, and (ii) BoBW-lil’UCB($\gamma$) achieves order-wise optimal performance for RM or BAI under different values of $\gamma$. Our work elucidates the trade-off more precisely by showing how the constants in previous works depend on certain hardness parameters. Finally, we show that BoBW-lil’UCB outperforms a close competitor UCB$_\alpha$ (Degenne et al., 2019) in terms of the time complexity and the regret on diverse datasets such as MovieLens and Published Kinase Inhibitor Set.

URL: https://openreview.net/forum?id=XXfEmIMJDm

---

Title: Learning Trees of $\ell_0$-Minimization Problems

Abstract: The problem of computing minimally sparse solutions of under-determined linear systems is $NP$ hard in general. Subsets with extra properties, may allow efficient algorithms, most notably problems with the restricted isometry property (RIP) can be solved by convex $\ell_1$-minimization. While these classes have been very successful, they leave out many practical applications.

In this paper, we consider alternative classes of tractable problems. Unlike the RIP, they can be adapted to new situations based on prior knowledge. This knowledge is gained through learning a curriculum that proceeds from easy to hard problems. The setup mimics curricula for human students to learn difficult problems in a targeted area of expertise.

URL: https://openreview.net/forum?id=lOOxqaAMDn

---

Title: CoBA: Causal Contextual Bandits with Active Data Integration

Abstract: We study a contextual bandit setting where the agent has the ability to request multiple data samples – corresponding to potentially different context-action pairs – simultaneously in one-shot within a budget, along with access to causal side information. This new formalism provides a natural model for several real-world scenarios where parallel targeted experiments can be conducted. We propose a new algorithm that utilizes a novel entropy-like measure that we introduce. We perform multiple experiments, both using purely synthetic data and using a real-world dataset, and show that our algorithm performs better than baselines in all of them. In addition, we also study sensitivity of our algorithm’s performance to various aspects of the problem setting. Further, we also show that the algorithm is sound; that is, as budget increases, the learned policy eventually converges to the optimal policy. Finally, we study fairness implications of our methodology.

URL: https://openreview.net/forum?id=3Kirw8FQrT

---

Title: Gradient Masked Averaging for Federated Learning

Abstract: Federated learning (FL) is an emerging paradigm that permits a large number of clients with heterogeneous data to coordinate learning of a unified global model without the need to share data amongst each other. A major challenge in federated learning is the heterogeneity of data across client, which can degrade the performance of standard FL algorithms. Standard FL algorithms involve averaging of model parameters or gradient updates to approximate the global model at the server. However, we argue that in heterogeneous settings, averaging can result in information loss and lead to poor generalization due to the bias induced by dominant client gradients. We hypothesize that to generalize better across non-i.i.d datasets, the algorithms should focus on learning the invariant mechanism that is constant while ignoring spurious mechanisms that differ across clients. Inspired from recent works in Out-of-Distribution generalization, we
propose a gradient masked averaging approach for FL as an alternative to the standard averaging of client updates. This aggregation technique for client updates can be adapted as a drop-in replacement in most existing federated algorithms. We perform extensive experiments on multiple FL algorithms with in-distribution, real-world, feature-skewed out-of-distribution, and quantity imbalanced datasets and show that it provides consistent improvements, particularly in the case of heterogeneous clients.

URL: https://openreview.net/forum?id=REAyrhRYAo

---

Title: On Lower Bounds for the Number of Queries in Clustering Algorithms

Abstract: We consider clustering with the help of an oracle when all queries are made at once and the clusters are determined after all the responses are received. We determine the minimum number of queries required to completely cluster all items. We also consider active clustering with the help of an oracle when a number of queries are made in the first round and the remaining queries are made in the second round based upon the responses to the queries from the first round. We determine a lower bound for the number of queries required to completely cluster all items based upon an analysis of the number of queries made in the first round.

URL: https://openreview.net/forum?id=vtORwCA6ze

---

Title: A Robust Backpropagation-Free Framework for Images

Abstract: While current deep learning algorithms have been successful for a wide variety of artificial intelligence (AI) tasks, including those involving structured image data, they present deep neurophysiological conceptual issues due to their reliance on the gradients, computed by backpropagation of errors (backprop). Gradients are required to obtain synaptic weight adjustments but require knowledge of feed-forward activities for the backward propagation. But because there is no known biological way for an error (backward) network to be precisely aware of the weights of the original (forward) network, many current deep learning algorithms are majorly biologically implausible. This is known as the “weight transport problem”. We present a more biologically plausible approach towards solving the weight transport problem for structured image data, by introducing the error-kernel driven activation alignment (EKDAA) algorithm, to train convolutional neural networks (CNNs) using locally derived error transmission kernels and error maps. Like standard deep learning networks, EKDAA performs the standard forward process via weights and activation functions, but its backward error computation involves learning error kernels to propagate local error signals through the network. We demonstrate the efficacy of EKDAA by performing the task of visual-recognition on the Fashion MNIST, CIFAR-10 and SVHN benchmarks, along with demonstrating its ability to extract visual features from natural color images. Furthermore, we present results for a CNN trained using a non-differentiable activation function.

URL: https://openreview.net/forum?id=leqr0vQzeN

---

Title: Bridging the Gap Between Target Networks and Functional Regularization

Abstract: Bootstrapping is behind much of the successes of deep Reinforcement Learning. However, learning the value function via bootstrapping often leads to unstable training due to fast-changing target values. Target Networks are employed to stabilize training by using an additional set of lagging parameters to estimate the target values. Despite the popularity of Target Networks, their effect on the optimization is still misunderstood. In this work, we show that they act as an implicit regularizer which can be beneficial in some cases, but also have disadvantages such as being inflexible and can result in instabilities, even when vanilla TD(0) converges. To overcome these issues, we propose an explicit Functional Regularization alternative that is flexible and a convex regularizer in function space and we theoretically study its convergence. We conducted an experimental study across a range of environments, discount factors, and off-policiness data collections to investigate the effectiveness of the regularization induced by Target Networks and Functional Regularization in terms of performance, accuracy, and stability. Our findings emphasize that Functional Regularization can be used as a drop-in replacement for Target Networks and result in performance improvement. Furthermore, adjusting both the regularization weight and the network update period in Functional Regularization can result in further performance improvements compared to solely adjusting the network update period as typically done with Target Networks. Our approach also enhances the ability to networks to recover accurate $Q$-values.

URL: https://openreview.net/forum?id=BFvoemrmqX

---

Reply all
Reply to author
Forward
0 new messages