Weekly TMLR digest for Jul 24, 2022

27 visualizações

Ir para a primeira mensagem não lida

TMLR

não lida,

23/07/2022, 20:00:0923/07/22

para tmlr-annou...@googlegroups.com

Accepted papers
===============

Title: Deep Classifiers with Label Noise Modeling and Distance Awareness

Authors: Vincent Fortuin, Mark Collier, Florian Wenzel, James Urquhart Allingham, Jeremiah Zhe Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou

Abstract: Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or on input-dependent label uncertainties for in-distribution calibration, both of these types of uncertainty are often necessary. In this work, we propose the HetSNGP method for jointly modeling the model and data uncertainty. We show that our proposed model affords a favorable combination between these two types of uncertainty and thus outperforms the baseline methods on some challenging out-of-distribution datasets, including CIFAR-100C, ImageNet-C, and ImageNet-A. Moreover, we propose HetSNGP Ensemble, an ensembled version of our method which additionally models uncertainty over the network parameters and outperforms other ensemble baselines.

URL: https://openreview.net/forum?id=Id7hTt78FV

---

Title: Ranking Recovery under Privacy Considerations

Authors: Minoh Jeong, Alex Dytso, Martina Cardone

Abstract: We consider the private ranking recovery problem, where a data collector seeks to estimate the permutation/ranking of a data vector given a randomized (privatized) version of it. We aim to establish fundamental trade-offs between the performance of the estimation task, measured in terms of probability of error, and the level of privacy that can be guaranteed when the noise mechanism consists of adding artificial noise. Towards this end, we show the optimality of a low-complexity decision rule (referred to as linear decoder) for the estimation task, under several noise distributions widely used in the privacy literature (e.g., Gaussian, Laplace, and generalized normal model). We derive the Taylor series of the probability of error, which yields its first and second-order approximations when such a linear decoder is employed. We quantify the guaranteed level of privacy using differential privacy (DP) types of metrics, such as $\epsilon$-DP and $(\alpha,\epsilon)$-Rényi DP. Finally, we put together the results to characterize trade-offs between privacy and probability of error.

URL: https://openreview.net/forum?id=2EOVIvRXlv

---

Title: Learning the Transformer Kernel

Authors: Sankalan Pal Chowdhury, Adamos Solomou, Kumar Avinava Dubey, Mrinmaya Sachan

Abstract: In this work we introduce KL-TRANSFORMER, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear. We show that KL-TRANSFORMERs achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy and computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.

URL: https://openreview.net/forum?id=tLIBAEYjcv

---

Title: Optimizing Functionals on the Space of Probabilities with Input Convex Neural Networks

Authors: David Alvarez-Melis, Yair Schiff, Youssef Mroueh

Abstract: Gradient flows are a powerful tool for optimizing functionals in general metric spaces, including the space of probabilities endowed with the Wasserstein metric. A typical approach to solving this optimization problem relies on its connection to the dynamic formulation of optimal transport and the celebrated Jordan-Kinderlehrer-Otto (JKO) scheme. However, this formulation involves optimization over convex functions, which is challenging, especially in high dimensions. In this work, we propose an approach that relies on the recently introduced input-convex neural networks (ICNN) to parametrize the space of convex functions in order to approximate the JKO scheme, as well as in designing functionals over measures that enjoy convergence guarantees. We derive a computationally efficient implementation of this JKO-ICNN framework and experimentally demonstrate its feasibility and validity in approximating solutions of low-dimensional partial differential equations with known solutions. We also demonstrate its viability in high-dimensional applications through an experiment in controlled generation for molecular discovery.

URL: https://openreview.net/forum?id=dpOYN7o8Jm

---

Title: Deformation Robust Roto-Scale-Translation Equivariant CNNs

Authors: Liyao Gao, Guang Lin, Wei Zhu

Abstract: Incorporating group symmetry directly into the learning process has proved to be an effective guideline for model design. By producing features that are guaranteed to transform covariantly to the group actions on the inputs, group-equivariant convolutional neural networks (G-CNNs) achieve significantly improved generalization performance in learning tasks with intrinsic symmetry. General theory and practical implementation of G-CNNs have been studied for planar images under either rotation or scaling transformation, but only individually. We present, in this paper, a roto-scale-translation equivariant CNN ($\mathcal{RST}$-CNN), that is guaranteed to achieve equivariance jointly over these three groups via coupled group convolutions. Moreover, as symmetry transformations in reality are rarely perfect and typically subject to input deformation, we provide a stability analysis of the equivariance of representation to input distortion, which motivates the truncated expansion of the convolutional filters under (pre-fixed) low-frequency spatial modes. The resulting model provably achieves deformation-robust $\mathcal{RST}$ equivariance, i.e., the $\mathcal{RST}$ symmetry is still "approximately” preserved when the transformation is "contaminated” by a nuisance data deformation, a property that is especially important for out-of-distribution generalization. Numerical experiments on MNIST, Fashion-MNIST, and STL-10 demonstrate that the proposed model yields remarkable gains over prior arts, especially in the small data regime where both rotation and scaling variations are present within the data.

URL: https://openreview.net/forum?id=yVkpxs77cD

---

Title: On the link between conscious function and general intelligence in humans and machines

Authors: Arthur Juliani, Kai Arulkumaran, Shuntaro Sasai, Ryota Kanai

Abstract: In popular media, there is often a connection drawn between the advent of awareness in artificial agents and those same agents simultaneously achieving human or superhuman level intelligence. In this work, we explore the validity and potential application of this seemingly intuitive link between consciousness and intelligence. We do so by examining the cognitive abilities associated with three contemporary theories of conscious function: Global Workspace Theory (GWT), Information Generation Theory (IGT), and Attention Schema Theory (AST). We find that all three theories specifically relate conscious function to some aspect of domain-general intelligence in humans. With this insight, we turn to the field of Artificial Intelligence (AI) and find that, while still far from demonstrating general intelligence, many state-of-the-art deep learning methods have begun to incorporate key aspects of each of the three functional theories. Having identified this trend, we use the motivating example of mental time travel in humans to propose ways in which insights from each of the three theories may be combined into a single unified and implementable model. Given that it is made possible by cognitive abilities underlying each of the three functional theories, artificial agents capable of mental time travel would not only possess greater general intelligence than current approaches, but also be more consistent with our current understanding of the functional role of consciousness in humans, thus making it a promising near-term goal for AI research.

URL: https://openreview.net/forum?id=LTyqvLEv5b

---

Title: Non-Deterministic Behavior of Thompson Sampling with Linear Payoffs and How to Avoid It

Authors: Doruk Kilitcioglu, Serdar Kadioglu

Abstract: Thompson Sampling with Linear Payoffs (LinTS) is popular contextual bandit algorithm for solving sequential decision making problem. While LinTS has been studied extensively in the academic literature, surprisingly, its behavior in terms of reproducibility did not receive the same attention. In this paper, we show that a standard and seemingly correct LinTS implementation leads to non-deterministic behavior. This might go unnoticed easily, yet impact results adversely. This calls the reproducibility of papers that use LinTS into question. Further, it forbids using this particular implementation in any industrial application where reproducibility is critical not only for debugging purposes but also for the trustworthiness of machine learning models. We first study the root cause of the non-deterministic behavior. We then conduct experiments on recommendation system benchmarks to demonstrate the impact of non-deterministic behavior in terms of reproducibility and downstream metrics. Finally, as a remedy, we show how to avoid the issue to ensure reproducible results and share general advice for practitioners.

URL: https://openreview.net/forum?id=sX9d3gfwtE

---

Title: Structural Learning in Artificial Neural Networks: A Neural Operator Perspective

Authors: Kaitlin Maile, Luga Hervé, Dennis George Wilson

Abstract: Over the history of Artificial Neural Networks (ANNs), only a minority of algorithms integrate structural changes of the network architecture into the learning process. Modern neuroscience has demonstrated that biological learning is largely structural, with mechanisms such as synaptogenesis and neurogenesis present in adult brains and considered important for learning. Despite this history of artificial methods and biological inspiration, and furthermore the recent resurgence of neural methods in deep learning, relatively few current ANN methods include structural changes in learning compared to those that only adjust synaptic weights during the training process. We aim to draw connections between different approaches of structural learning that have similar abstractions in order to encourage collaboration and development. In this review, we provide a survey on structural learning methods in deep ANNs, including a new neural operator framework from a cellular neuroscience context and perspective aimed at motivating research on this challenging topic. We then provide an overview of ANN methods which include structural changes within the neural operator framework in the learning process, characterizing each neural operator in detail and drawing connections to their biological counterparts. Finally, we present overarching trends in how these operators are implemented and discuss the open challenges in structural learning in ANNs.

URL: https://openreview.net/forum?id=gzhEGhcsnN

---

New submissions
===============

Title: A Simple Convergence Proof of Adam and Adagrad

Abstract: We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer and the total number of iterations $N$. This bound can be made arbitrarily small: Adam with a learning rate $\alpha=1/\sqrt{N}$ and a momentum parameter on squared gradients $\beta_2=1-1/N$ achieves the same rate of convergence $O(\ln(N)/\sqrt{N})$ as Adagrad. Finally, we obtain the tightest dependency on the heavy ball momentum among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1-\beta_1)^{-3})$ to $O((1-\beta_1)^{-1})$. Our technique also improves
the best known dependency for standard SGD by a factor $1 - \beta_1$.

URL: https://openreview.net/forum?id=ZPQhzTSWA7

---

Title: From Optimization Dynamics to Generalization Bounds via Łojasiewicz Gradient Inequality

Abstract:
Optimization and generalization are two essential aspects of statistical machine learning. In this paper, we propose a framework to connect optimization with generalization by analyzing the generalization error based on the optimization trajectory under the gradient flow algorithm. The key ingredient of this framework is the Uniform-LGI, a property that is generally satisfied when training machine learning models. Leveraging the Uniform-LGI, we first derive convergence rates for gradient flow algorithm, then we give generalization bounds for a large class of machine learning models. We further apply our framework to three distinct machine learning models: linear regression, kernel regression, and two-layer neural networks. Through our approach, we obtain generalization estimates that match or extend previous results.

URL: https://openreview.net/forum?id=mW6nD3567x

---

Title: LIMIS: Locally Interpretable Modeling using Instance-wise Subsampling

Abstract: Understanding black-box machine learning models is crucial for their widespread adoption. Learning globally interpretable models is one approach, but achieving high performance with them is challenging. An alternative approach is to explain individual predictions using locally interpretable models. For locally interpretable modeling, various methods have been proposed and indeed commonly used, but they suffer from low fidelity, i.e. their explanations do not approximate the predictions well. In this paper, our goal is to push the state-of-the-art in high-fidelity locally interpretable modeling. We propose a novel framework, Locally Interpretable Modeling using Instance-wise Subsampling (LIMIS). LIMIS utilizes a policy gradient to select a small number of instances and distills the black-box model into a low-capacity locally interpretable model using those selected instances. Training is guided with a reward obtained directly by measuring the fidelity of the locally interpretable models. We show on multiple tabular datasets that LIMIS near-matches the prediction accuracy of black-box models, significantly outperforming state-of-the-art locally interpretable models in terms of fidelity and prediction accuracy.

URL: https://openreview.net/forum?id=S8eABAy8P3

---

Title: Efficient Gradient Flows in Sliced-Wasserstein Space

Abstract: Minimizing functionals in the space of probability distributions can be done with Wasser-
stein gradient flows. To solve them numerically, a possible approach is to rely on the
Jordan–Kinderlehrer–Otto (JKO) scheme which is analogous to the proximal scheme in
Euclidean spaces. However, it requires solving a nested optimization problem at each it-
eration, and is known for its computational challenges, especially in high dimension. To
alleviate it, very recent works propose to approximate the JKO scheme leveraging Brenier’s
theorem, and using gradients of Input Convex Neural Networks to parameterize the density
(JKO-ICNN). However, this method comes with a high computational cost and stability is-
sues. Instead, this work proposes to use gradient flows in the space of probability measures
endowed with the sliced-Wasserstein (SW) distance. We argue that this method is more flex-
ible than JKO-ICNN, since SW enjoys a closed-form differentiable approximation. Thus,
the density at each step can be parameterized by any generative model which alleviates the
computational burden and makes it tractable in higher dimensions.

URL: https://openreview.net/forum?id=Au1LNKmRvh

---

Title: Bounding generalization error with input compression: An empirical study with infinite-width networks

Abstract: Estimating the Generalization Error (GE) of Deep Neural Networks (DNNs) is an important task that often relies on availability of held-out data. The ability to better predict GE based on a single training set may yield overarching DNN design principles to reduce a reliance on trial-and-error, along with other performance assessment advantages.
In search of a quantity relevant to GE, we investigate the Mutual Information (MI) between the input and final layer representations, using the infinite-width DNN limit to bound MI. An existing input compression-based GE bound is used to link MI and GE. To the best of our knowledge, this represents the first empirical study of this bound. In our attempt to empirically falsify the theoretical bound, we find that it is often tight for best-performing models. Furthermore, it detects randomization of training labels in many cases, reflects test-time perturbation robustness, and works well given only few training samples. These results are promising given that input compression is broadly applicable where MI can be estimated with confidence.

URL: https://openreview.net/forum?id=jbZEUtULft

---

Title: Explicit Group Sparse Projection with Applications to Deep Learning and NMF

Abstract: We design a new sparse projection method for a set of vectors that guarantees a desired average sparsity level measured leveraging the popular Hoyer measure (an affine function of the ratio of the $\ell_1$ and $\ell_2$ norms).
Existing approaches either project each vector individually or require the use of a regularization parameter which implicitly maps to the average $\ell_0$-measure of sparsity. Instead, in our approach we set the sparsity level for the whole set explicitly and simultaneously project a group of vectors with the sparsity level of each vector tuned automatically.
We show that the computational complexity of our projection operator is linear in the size of the problem.
Additionally, we propose a generalization of this projection by replacing the $\ell_1$ norm by its weighted version.
We showcase the efficacy of our approach in both supervised and unsupervised learning tasks on image datasets including CIFAR10 and ImageNet. In deep neural network pruning, the sparse models produced by our method on ResNet50 have significantly higher accuracies at corresponding sparsity values compared to existing competitors. In nonnegative matrix factorization, our approach yields competitive reconstruction errors against state-of-the-art algorithms.

URL: https://openreview.net/forum?id=jIrOeWjdpc

---

Title: Unimodal Likelihood Models for Ordinal Data

Abstract: Ordinal regression (OR) is the classification of ordinal data, in which the underlying target variable is categorical and considered to have a natural ordinal relation for the explanatory variable. We in this study suppose the unimodality of conditional probability distributions as a natural ordinal relation of the ordinal data. Under this supposition, unimodal likelihood models are expected promising for improving the generalization performance in OR tasks. Demonstrating that previous unimodal likelihood models have a weak representation ability, we thus develop more representable unimodal models including the most representable one. Our took OR experiments show that the developed more representable unimodal models yielded better generalization performance for real-world ordinal data compared with previous unimodal models and popular statistical OR models having no unimodality guarantee.

URL: https://openreview.net/forum?id=1l0sClLiPc

---

Title: Bridging Offline and Online Experimentation: Constraint Active Search for Deployed Performance Optimization

Abstract: A common challenge in machine learning model development is that models perform differently between the offline development phase and the eventual deployment phase. Fundamentally, the goal of such a model is to maximize performance during deployment, but such performance can not be measured offline. As such, we propose to augment the standard offline sample efficient hyperparameter optimization to instead search offline for a diverse set of models which can have potentially superior online performance. To this end, we utilize Constraint Active Search to identify such a diverse set of models, and we study their online performance using a variant of Best Arm Identification to select the best model for deployment. The key contribution of this article is the theoretical analysis of this development phase, both in analyzing the probability of improvement over the baseline as well as the number of viable treatments for online testing. We demonstrate the viability of this strategy on synthetic examples, as well as a recommendation system benchmark.

URL: https://openreview.net/forum?id=XX8CEN815d

---

Title: Modified Threshold Method for Ordinal Regression

Abstract: Ordinal regression (OR, also called ordinal classification) is the classification of ordinal data in which the underlying target variable is discrete and has a natural ordinal relation. For OR problems, threshold methods are often employed since they are considered to capture the ordinal relation of data well: they learn a one-dimensional transformation (1DT) of the explanatory variable and classify the data by labeling that learned 1DT according to the rank of the interval to which the 1DT belongs among intervals of the number of classes. In existing methods, threshold parameters for separating intervals are determined regardless of the learning result of the 1DT and the task under consideration, which has no theoretical rationality. Such conventional settings may deteriorate the classification performance. We, therefore, propose a novel computationally efficient method for determining the threshold parameters: it learns each threshold parameter independently through solving a problem relaxed from the minimization of the empirical task risk for the learned 1DT. The proposed labeling procedure experimentally gave superior classification performance with a feasible degree of additional computational load compared to four related existing labeling procedures.

URL: https://openreview.net/forum?id=PInXz6Gasv

---

Title: Faking Interpolation Until You Make It

Abstract: Deep over-parameterized neural networks exhibit the interpolation property on many data sets. Specifically, these models can achieve approximately zero loss on all training samples simultaneously. This property has been exploited to develop optimisation algorithms for this setting. These algorithms use the fact that the optimal loss value is known to employ a variation of a Polyak step size calculated on each stochastic batch of data. We introduce a novel extension of this idea to tasks where the interpolation property does not hold. As we no longer have access to the optimal loss values a priori, we instead estimate them for each sample online. To realise this, we introduce a simple but highly effective heuristic for approximating the optimal value based on previous loss evaluations. We provide rigorous experimentation on a range of problems. From our empirical analysis we demonstrate the effectiveness of our approach, which outperforms other single hyperparameter optimisation methods.

URL: https://openreview.net/forum?id=OslAMMF4ZP

---

Title: MixTailor: Mixed Gradient Aggregation for Robust Learning Against Tailored Attacks

Abstract: Implementations of SGD on distributed and multi-GPU systems creates new vulnerabilities, which can be identified and misused by one or more adversarial agents. Recently, it has been shown that well-known Byzantine-resilient gradient aggregation schemes are indeed vulnerable to informed attackers that can tailor the attacks (Fang et al., 2020; Xie et al., 2020b). We introduce MixTailor, a scheme based on randomization of the aggregation strategies that makes it impossible for the attacker to be fully informed. Deterministic schemes can be integrated into MixTailor on the fly without introducing any additional hyperparameters. Randomization decreases the capability of a powerful adversary to tailor its attacks, while the resulting randomized aggregation scheme is still competitive in terms of performance. For both iid and non-iid settings, we establish almost sure convergence guarantees that are both stronger and more general than those available in the literature. Our empirical studies across various datasets, attacks, and settings, validate our hypothesis and show that MixTailor successfully defends when well-known Byzantine-tolerant schemes fail.

URL: https://openreview.net/forum?id=tqDhrbKJLS

---

Title: A Fast and Convergent Proximal Algorithm for Regularized Nonconvex and Nonsmooth Bi-level Optimization

Abstract: Many important machine learning applications involve regularized nonconvex bi-level optimization. However, the existing gradient-based bi-level optimization algorithms cannot handle nonconvex or nonsmooth regularizers, and they suffer from a high computation complexity in nonconvex bi-level optimization. In this work, we study a proximal gradient-type algorithm that adopts the approximate implicit differentiation (AID) scheme for nonconvex bi-level optimization with possibly nonconvex and nonsmooth regularizers. In particular, the algorithm applies the Nesterov's momentum to accelerate the computation of the implicit gradient involved in AID. We provide a comprehensive analysis of the global convergence properties of this algorithm through identifying its intrinsic potential function. In particular, we formally establish the convergence of the model parameters to a critical point of the bi-level problem, and obtain an improved computation complexity $\mathcal{O}(\kappa^{3.5}\epsilon^{-2})$ over the state-of-the-art result. Moreover, we analyze the asymptotic convergence rates of this algorithm under a class of local nonconvex geometries characterized by a {\L}ojasiewicz-type gradient inequality. Experiment on hyper-parameter optimization demonstrates the effectiveness of our algorithm.

URL: https://openreview.net/forum?id=8xYjvaCxNR

---

Title: An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

Abstract: Off-policy prediction, learning the value function for one policy from data generated while following another policy, is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy TD, Vtrace, and variants of Tree Backup and ABQ that are derived in this paper such that they are applicable to the prediction setting. Our experiments used the Collision task, a small off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. We assessed the performance of the algorithms according to their learning rate, asymptotic error level, and sensitivity to step-size and bootstrapping parameters. By these measures, the eleven algorithms can be partially ordered on the Collision task. In the top tier, the two Emphatic-TD algorithms learned the fastest, reached the lowest errors, and were robust to parameter settings. In the middle tier, the five Gradient-TD algorithms and Off-policy TD were more sensitive to the bootstrapping parameter. The bottom tier comprised Vtrace, Tree Backup, and ABQ; these algorithms were no faster and had higher asymptotic error than the others. Our results are definitive for this task, though of course experiments with more tasks are needed before an overall assessment of the algorithms' merits can be made.

URL: https://openreview.net/forum?id=4w3Pya9OxC

---

Title: Sequential Deconfounding for Causal Inference with Unobserved Confounders

Abstract: Observational data is often used to estimate the effect of a treatment when randomized experiments are infeasible or costly. However, observational data often yields biased estimates of treatment effects, since treatment assignment can be confounded by unobserved variables. A remedy is offered by deconfounding methods that adjust for such unobserved confounders. In this paper, we develop the Sequential Deconfounder, a method that enables estimating individualized treatment effects over time in presence of unobserved confounders. This is the first deconfounding method that can be used with a single treatment assigned at each timestep. The Sequential Deconfounder uses a novel Gaussian process latent variable model to infer substitutes for the unobserved confounders, which are then used in conjunction with an outcome model to estimate treatment effects over time. We prove that using our method yields unbiased estimates of individualized treatment responses over time. Using simulated and real medical data, we demonstrate the efficacy of our method in deconfounding the estimation of treatment responses over time.

URL: https://openreview.net/forum?id=DZPydgaUY5

---

Title: Competition over data: how does data purchase affect users?

Abstract: As the competition among machine learning (ML) predictors is widespread in practice, it becomes increasingly important to understand the impact and biases arising from such competition. One critical aspect of ML competition is that ML predictors are constantly updated by acquiring additional data during the competition. Although this active data acquisition can largely affect the overall competition environment, it has not been well-studied before. In this paper, we study what happens when ML predictors can purchase additional data during the competition. We introduce a new environment in which ML predictors use active learning algorithms to effectively acquire labeled data within their budgets while competing against each other. We empirically show that the overall performance of an ML predictor improves when predictors can purchase additional labeled data. Surprisingly, however, the quality that users experience---i.e., the accuracy of the predictor selected by each user---can decrease even as the individual predictors get better. We demonstrate that this phenomenon naturally arises due to a trade-off whereby competition pushes each predictor to specialize in a subset of the population while data purchase has the effect of making predictors more uniform. With comprehensive experiments, we show that our findings are robust against different modeling assumptions.

URL: https://openreview.net/forum?id=63sJsCmq6Q

---

Title: Target Propagation via Regularized Inversion

Abstract: Target Propagation (TP) algorithms compute targets instead of gradients along neural networks and propagate them backward in a way that is similar yet different than gradient back-propagation (BP). The idea was first presented as a perturbative alternative to BP that may improve gradient evaluation accuracy when training multi-layer neural networks (LeCun, 1985) and has gained popularity as a biologically plausible counterpart of BP. However, TP may have remained more of a template algorithm with many variations than a well-identified algorithm. Revisiting insights of LeCun (1985); Lee et al (2015), we present a simple version of TP based on regularized inversions of network layers that sheds light on the relevance of TP from an optimization viewpoint and is easily implementable in a differentiable programming framework. We show how TP can be used to train recurrent neural networks with long sequences on various sequence modeling problems and delineate theoretically and empirically the regimes in which the computational complexity of TP can be attractive compared to BP.

URL: https://openreview.net/forum?id=vxyjTUPV24

---

Title: Distribution Embedding Networks for Generalization from a Diverse Set of Classification Tasks

Abstract: We propose Distribution Embedding Networks (DEN) for classification with small data. In the same spirit of meta-learning, DEN learns from a diverse set of training tasks with the goal to generalize to unseen target tasks. Unlike existing approaches which require the inputs of training and target tasks to have the same dimensionality with possibly similar distributions, DEN allows training and target tasks to live in heterogeneous input spaces. This is especially useful for tabular-data tasks where labeled data from related tasks are scarce. DEN uses a three-block architecture: a covariate transformation block followed by a distribution embedding block and then a classification block. We provide theoretical insights to show that this architecture allows the embedding and classification blocks to be fixed after pre-training on a diverse set of tasks; only the covariate transformation block with relatively few parameters needs to be fine-tuned for each new task. To facilitate training, we also propose an approach to synthesize binary classification tasks, and demonstrate that DEN outperforms existing methods in a number of synthetic and real tasks in numerical studies.

URL: https://openreview.net/forum?id=F2rG2CXsgO

---

Title: Calibrate and Debias Layer-wise Sampling for Graph Convolutional Networks

Abstract: Multiple sampling-based methods have been developed for approximating and accelerating node embedding aggregation in graph convolutional networks (GCNs) training. Among them, a layer-wise approach recursively performs importance sampling to select neighbors jointly for existing nodes in each layer. This paper revisits the approach from a matrix approximation perspective and identifies two issues in the existing layer-wise sampling methods: sub-optimal sampling probabilities and estimation biases induced by sampling without replacement. To address these issues, we accordingly propose two remedies: a new principle for constructing sampling probabilities and an efficient debiasing algorithm. The improvements are demonstrated by extensive analyses of estimation variance and experiments on common benchmarks.

URL: https://openreview.net/forum?id=JyKNuoZGux

---

Title: DHA: End-to-End Joint Optimization of Data Augmentation Policy, Hyper-parameter and Architecture

Abstract: Automated machine learning (AutoML) usually involves several crucial components, such as Data Augmentation (DA) policy, Hyper-Parameter Optimization (HPO), and Neural Architecture Search (NAS).
Although many strategies have been developed for automating these components in separation, joint optimization of these components remains challenging due to the largely increased search dimension and the variant input types of each component. In parallel to this, the common practice of \textit{searching} for the optimal architecture first and then \textit{retraining} it before deployment in NAS often suffers from the low-performance correlation between the searching and retraining stages. An end-to-end solution that integrates the AutoML components and returns a ready-to-use model at the end of the search is desirable.
In view of these, we propose \textbf{DHA}, which achieves joint optimization of \textbf{D}ata augmentation policy, \textbf{H}yper-parameter and \textbf{A}rchitecture. Specifically, end-to-end NAS is achieved in a differentiable manner by optimizing a compressed lower-dimensional feature space, while DA policy and HPO are updated dynamically at the same time. Experiments show that DHA achieves state-of-the-art (SOTA) results on various datasets and search spaces.
To the best of our knowledge, we are the first to efficiently and jointly optimize DA policy, NAS, and HPO in an end-to-end manner without retraining.

URL: https://openreview.net/forum?id=MHOAEiTlen

---

Title: Practicality of generalization guarantees for unsupervised domain adaptation with neural networks

Abstract: Understanding generalization is crucial to confidently engineer and deploy machine learning models, especially when deployment implies a shift in the data domain.
For such domain adaptation problems, we seek generalization bounds which are tractably computable and tight. If these desiderata can be reached, the bounds can serve as guarantees for adequate performance in deployment.
However, in applications where deep neural networks are the models of choice, deriving results which fulfill these remains an unresolved challenge; most existing bounds are either vacuous or has non estimable terms, even in favorable conditions.
In this work, we evaluate existing bounds from the literature with potential to satisfy our desiderata on domain adaptation image classification tasks, where deep neural networks are preferred. We find that all bounds are vacuous and that sample generalization terms account for much of the observed looseness, especially when these terms interact with measures of domain shift. To overcome this and arrive at the tightest possible results, we combine each bound with recent data-dependent PAC-Bayes analysis, greatly improving the guarantees. We find that, when domain overlap can be assumed, a simple importance weighting extension of previous work provides the tightest estimable bound. Finally, we study which terms dominate the bounds and identify possible directions for further improvement.

URL: https://openreview.net/forum?id=vUuHPRrWs2

---

Responder a todos

Responder ao autor

Reencaminhar

0 mensagens novas