Weekly TMLR digest for Nov 13, 2022

3 views

Skip to first unread message

TMLR

unread,

Nov 12, 2022, 7:00:11 PM11/12/22

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, Nando de Freitas

https://openreview.net/forum?id=1ikK0kHjvj

---

Featured Certification: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

https://openreview.net/forum?id=AFDcYJKhND

---

Accepted papers
===============

Title: Approximate Policy Iteration with Bisimulation Metrics

Authors: Mete Kemertas, Allan Douglas Jepson

Abstract: Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation (VFA). In this work we first prove that bisimulation and $\pi$-bisimulation metrics can be defined via a more general class of Sinkhorn distances, which unifies various state similarity metrics used in recent work. Then we describe an approximate policy iteration (API) procedure that uses a bisimulation-based discretization of the state space for VFA and prove asymptotic performance bounds. Next, we bound the difference between $\pi$-bisimulation metrics in terms of the change in the policies themselves. Based on these results, we design an API($\alpha$) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. We discuss how such API procedures map onto practical actor-critic methods that use bisimulation metrics for state representation learning. Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs.

URL: https://openreview.net/forum?id=Ii7UeHc0mO

---

Title: Exposing Outlier Exposure: What Can Be Learned From Few, One, and Zero Outlier Images

Authors: Philipp Liznerski, Lukas Ruff, Robert A. Vandermeulen, Billy Joe Franks, Klaus Robert Muller, Marius Kloft

Abstract: Due to the intractability of characterizing everything that looks unlike the normal data, anomaly detection (AD) is traditionally treated as an unsupervised problem utilizing only normal samples. However, it has recently been found that unsupervised image AD can be
drastically improved through the utilization of huge corpora of random images to represent anomalousness; a technique which is known as Outlier Exposure. In this paper we show that specialized AD learning methods seem unnecessary for state-of-the-art performance, and furthermore one can achieve strong performance with just a small collection of Outlier Exposure data, contradicting common assumptions in the field of AD. We find that standard classifiers and semi-supervised one-class methods trained to discern between normal samples and relatively few random natural images are able to outperform the current state of the art on an established AD benchmark with ImageNet. Further experiments reveal that even one well-chosen outlier sample is sufficient to achieve decent performance on this benchmark (79.3% AUC). We investigate this phenomenon and find that one-class methods are more robust to the choice of training outliers, indicating that there are scenarios where these are still more useful than standard classifiers. Additionally, we include experiments that delineate the scenarios where our results hold. Lastly, no training samples are necessary when one uses the representations learned by CLIP, a recent foundation model, which achieves state-of-the-art AD results on CIFAR-10 and ImageNet in a zero-shot setting.

URL: https://openreview.net/forum?id=3v78awEzyB

---

Title: Mitigating Catastrophic Forgetting in Spiking Neural Networks through Threshold Modulation

Authors: Ilyass Hammouamri, Timothée Masquelier, Dennis George Wilson

Abstract: Artificial Neural Networks (ANNs) trained with Backpropagation and Stochastic Gradient Descent (SGD) suffer from the problem of Catastrophic Forgetting; when learning tasks sequentially, the ANN tends to abruptly forget previous knowledge upon being trained on a new task. On the other hand, biological neural networks do not suffer from this problem. Spiking Neural Networks (SNNs) are a class of Neural Networks that are closer to biological networks than ANNs and their intrinsic properties inspired from biology could alleviate the problem of Catastrophic Forgetting. In this paper, we investigate if the firing threshold mechanism of SNNs can be used to gate the activity of the network in order to reduce catastrophic forgetting. To this end, we evolve a Neuromodulatory Network that adapts the thresholds of an SNN depending on the spiking activity of the previous layer. Our experiments on different datasets show that the neurmodulated SNN can mitigate forgetting significantly with respect to a fixed threshold SNN. We also show that the evolved Neuromodulatory Network can generalize to multiple new scenarios and analyze its behavior.

URL: https://openreview.net/forum?id=15SoThZmtU

---

Title: A Generalist Agent

Authors: Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, Nando de Freitas

Abstract: Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.

URL: https://openreview.net/forum?id=1ikK0kHjvj

---

Title: ZerO Initialization: Initializing Neural Networks with only Zeros and Ones

Authors: Jiawei Zhao, Florian Tobias Schaefer, Anima Anandkumar

Abstract: Deep neural networks are usually initialized with random weights, with adequately selected initial variance to ensure stable signal propagation during training. However, selecting the appropriate variance becomes challenging especially as the number of layers grows. In this work, we replace random weight initialization with a fully deterministic initialization scheme, viz., ZerO, which initializes the weights of networks with only zeros and ones (up to a normalization factor), based on identity and Hadamard transforms. Through both theoretical and empirical studies, we demonstrate that ZerO is able to train networks without damaging their expressivity. Applying ZerO on ResNet achieves state-of-the-art performance on various datasets, including ImageNet, which suggests random weights may be unnecessary for network initialization. In addition, ZerO has many benefits, such as training ultra deep networks (without batch-normalization), exhibiting low-rank learning trajectories that result in low-rank and sparse solutions, and improving training reproducibility.

URL: https://openreview.net/forum?id=1AxQpKmiTc

---

Title: A Rigorous Study Of The Deep Taylor Decomposition

Authors: Leon Sixt, Tim Landgraf

Abstract: Saliency methods attempt to explain deep neural networks by highlighting the most salient features of a sample. Some widely used methods are based on a theoretical framework called Deep Taylor Decomposition (DTD), which formalizes the recursive application of the Taylor Theorem to the network's layers. However, recent work has found these methods to be independent of the network's deeper layers and appear to respond only to lower-level image structure. Here, we investigate DTD theory to better understand this perplexing behavior and found that the Deep Taylor Decomposition is equivalent to the basic gradient$\times$input method when the Taylor root points (an important parameter of the algorithm chosen by the user) are locally constant. If the root points are locally input-dependent, then one can justify any explanation. In this case, the theory is under-constrained. In an empirical evaluation, we find that DTD roots do not lie the same linear regions as the input -- contrary to a fundamental assumption of the Taylor Theorem. The theoretical foundations of DTD were cited as a source of reliability for the explanations. However, our findings urge caution in making such claims.

URL: https://openreview.net/forum?id=Y4mgmw9OgV

---

Title: Convergence of denoising diffusion models under the manifold hypothesis

Authors: Valentin De Bortoli

Abstract: Denoising diffusion models are a recent class of generative models exhibiting state-of-the-art performance in image and audio synthesis. Such models approximate the time-reversal of a forward noising process from a target distribution to a reference measure, which is usually Gaussian. Despite their strong empirical results, the theoretical analysis of such models remains limited. In particular, all current approaches crucially assume that the target density admits a density w.r.t. the Lebesgue measure. This does not cover settings where the target distribution is supported on a lower-dimensional manifold or is given by some empirical distribution. In this paper, we bridge this gap by providing the first convergence results for diffusion models in this setting. In particular, we provide quantitative bounds on the Wasserstein distance of order one between the target data distribution and the generative distribution of the diffusion model.

URL: https://openreview.net/forum?id=MhK5aXo3gB

---

Title: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Authors: Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

Abstract: We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements.

URL: https://openreview.net/forum?id=AFDcYJKhND

---

Title: Fail-Safe Adversarial Generative Imitation Learning

Authors: Philipp Geiger, Christoph-Nikolas Straehle

Abstract: For flexible yet safe imitation learning (IL), we propose theory and a modular method, with a safety layer that enables a closed-form probability density/gradient of the safe generative continuous policy, end-to-end generative adversarial training, and worst-case safety guarantees. The safety layer maps all actions into a set of safe actions, and uses the change-of-variables formula plus additivity of measures for the density. The set of safe actions is inferred by first checking safety of a finite sample of actions via adversarial reachability analysis of fallback maneuvers, and then concluding on the safety of these actions' neighborhoods using, e.g., Lipschitz continuity. We provide theoretical analysis showing the robustness advantage of using the safety layer already during training (imitation error linear in the horizon) compared to only using it at test time (up to quadratic error). In an experiment on real-world driver interaction data, we empirically demonstrate tractability, safety and imitation performance of our approach.

URL: https://openreview.net/forum?id=e4Bb0b3QgJ

---

Title: A Note on "Assessing Generalization of SGD via Disagreement"

Authors: Andreas Kirsch, Yarin Gal

Abstract: Several recent works find empirically that the average test error of deep neural networks can be estimated via the prediction disagreement of models, which does not require labels. In particular, Jiang et al. (2022) show for the disagreement between two separately trained networks that this `Generalization Disagreement Equality' follows from the well-calibrated nature of deep ensembles under the notion of a proposed `class-aggregated calibration.' In this reproduction, we show that the suggested theory might be impractical because a deep ensemble's calibration can deteriorate as prediction disagreement increases, which is precisely when the coupling of test error and disagreement is of interest, while labels are needed to estimate the calibration on new datasets. Further, we simplify the theoretical statements and proofs, showing them to be straightforward within a probabilistic context, unlike the original hypothesis space view employed by Jiang et al. (2022).

URL: https://openreview.net/forum?id=oRP8urZ8Fx

---

New submissions
===============

Title: Settling the Communication Complexity for Distributed Offline Reinforcement Learning

Abstract: We study a novel setting in offline reinforcement learning (RL) where a number of distributed machines jointly cooperate to solve the problem but only one single round of communication is allowed and there is a budget constraint on the total number of information (in terms of bits) that each machine can send out. For value function prediction in contextual bandits, and both episodic and non-episodic MDPs, we establish information-theoretic lower bounds on the minimax risk for distributed statistical estimators; this reveals the minimum amount of communication required by any offline RL algorithms. Specifically, for contextual bandits, we show that the number of bits must scale at least as $\Omega(AC)$ to match the centralised minimax optimal rate, where $A$ is the number of actions and $C$ is the context dimension; meanwhile, we reach similar results in the MDP settings. Furthermore, we develop learning algorithms based on least-squares estimates and Monte-Carlo return estimates and provide a sharp analysis showing that they can achieve optimal risk up to logarithmic factors. Additionally, we also show that temporal difference is unable to efficiently utilise information from all available devices under the single-round communication setting due to the initial bias of this method. To our best knowledge, this paper presents the first minimax lower bounds for distributed offline RL problems.

URL: https://openreview.net/forum?id=155XNbeViS

---

Title: Towards Large Scale Transfer Learning for Differentially Private Image Classification

Abstract: Differentially Private Stochastic Gradient Descent (DP-SGD) has emerged as a popular private training algorithm. Unfortunately, the computational cost of training large-scale models with DP-SGD is substantially higher than non-private training. This is further exacerbated by the fact that increasing the number of parameters leads to larger degradation in utility with DP.
In this work, we zoom in on the ImageNet dataset and demonstrate that, similar to the non-private case, pre-training over-parameterized models on a large public dataset can lead to substantial gains when the models are finetuned privately. Moreover, by systematically comparing private and non-private models across a range of large batch sizes, we find that similar to the non-private setting, the choice of optimizer can further improve performance substantially with DP. By using the LAMB optimizer, we saw improvement of up to 20$\%$ points (absolute). We also show that finetuning just the last layer for a \emph{single step} in the full batch setting, combined with extremely small-scale (near-zero) initialization leads to both SOTA results of 81.7 $\%$ under a wide privacy budget range of $\epsilon \in [4, 10]$ and $\delta$ = $10^{-6}$ while minimizing the computational overhead substantially. Finally, we present additional results on CIFAR-10 and CIFAR-100, surpassing previous state of the art by leveraging transfer learning with our recommendations.

URL: https://openreview.net/forum?id=Uu8WwCFpQv

---

Title: Reward-Predictive Clustering

Abstract: Recent advances in reinforcement-learning research have demonstrated impressive results in building algorithms that can out-perform humans in complex tasks. Nevertheless, creating reinforcement-learning systems that can build abstractions of their experience to accelerate learning in new contexts still remains an active area of research. Previous work showed that reward-predictive state abstractions fulfill this goal, but have only be applied to tabular settings. Here, we provide a clustering algorithm that enables the application of such state abstractions to deep learning settings, providing compressed representations of an agent's inputs that preserve the ability to predict sequences of reward. A convergence theorem and simulations show that the resulting reward-predictive deep network maximally compresses the agent's inputs, significantly speeding up learning in high dimensional visual control tasks. Furthermore, we present different generalization experiments and analyze under which conditions a pre-trained reward-predictive representation network can be re-used without re-training to accelerate learning---a form of systematic out-of-distribution transfer.

URL: https://openreview.net/forum?id=1GEYhxqIlJ

---

Title: Clustering using Approximate Nearest Neighbour Oracles

Abstract: We study the problem of clustering data points in a streaming setting when one has access to the geometry of the space only via approximate nearest neighbour (ANN) oracles. In this setting, we present algorithms for streaming $O(1)$-approximate $k$-median clustering and its (streaming) coreset construction. In many domains of interest, our algorithms improve upon the best-known runtime of both these problems. Furthermore, our results extend to cost functions satisfying the approximate triangle inequality, which subsumes $k$-means clustering and $M$-estimators. Finally, we run experiments on Census1990 dataset wherein the results empirically support our theory.

URL: https://openreview.net/forum?id=TzRXyO3CzX

---

Title: Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks

Abstract: In this paper, we study the OOD generalization of neural algorithmic reasoning tasks, where the goal is to learn an algorithm (e.g., sorting, breadth-first search, and depth-first search) from input-output pairs using deep neural networks. First, we argue that OOD generalization in this setting is significantly different than common OOD settings. For example, some phenomena in OOD generalization of image classifications such as \emph{accuracy on the line} are not observed here, and techniques such as data augmentation methods do not help as assumptions underlying many augmentation techniques are often violated. Second, we analyze the main challenges (e.g., input distribution shift, non-representative data generation, and uninformative validation metrics) of the current leading benchmark, i.e., CLRS \citep{deepmind2021clrs}, which contains 30 algorithmic reasoning tasks. We propose several solutions, including a simple-yet-effective fix to the input distribution shift and improved data generation. Finally, we propose an attention-based 2WL-graph neural network (GNN) processor which complements message-passing GNNs so their combination outperforms the state-of-the-art model by a $3\%$ margin averaged over all algorithms.

URL: https://openreview.net/forum?id=xkrtvHlp3P

---

Title: Contrastive Search Is What You Need For Neural Text Generation

Abstract: Generating text with autoregressive language models (LMs) is of great importance to many natural language processing (NLP) applications. Previous solutions for this task often produce text that contains degenerative expressions (Welleck et al., 2020) or lacks semantic consistency (Basu et al., 2021). Recently, Su et al. (2022b) introduced a new decoding method, contrastive search, based on the isotropic representation space of the language model and obtained new state of the art on various benchmarks. Additionally, Su et al. (2022b) argued that the representations of autoregressive LMs (e.g. GPT-2) are intrinsically anisotropic which is also shared by previous studies (Ethayarajh, 2019). Therefore, to ensure the language model follows an isotropic distribution, Su et al. (2022b) proposed a contrastive learning scheme, SimCTG, which calibrates the language model’s representations through additional training.

In this study, we first answer the question: “Are autoregressive LMs really anisotropic?”. To this end, we extensively evaluate the isotropy of LMs across 16 major languages. Surprisingly, we find that the anisotropic problem only exists in the two specific English GPT-2-small/medium models. On the other hand, all other evaluated LMs are naturally isotropic which is in contrast to the conclusion drawn by previous studies (Ethayarajh, 2019; Su et al., 2022b). Based on our findings, we further assess the contrastive search decoding method using off-the-shelf LMs on four generation tasks across 16 languages. Our experimental results demonstrate that contrastive search significantly outperforms previous decoding methods without any additional training. More notably, on 12 out of the 16 evaluated languages, contrastive search performs comparably with human-level performances as judged by human evaluations.

URL: https://openreview.net/forum?id=GbkWw3jwL9

---

Title: Learning Representations for Pixel-based Control: What Matters and Why?

Abstract: Learning representations for pixel-based control has garnered significant attention recently in reinforcement learning. A wide range of methods have been proposed to enable efficient learning, leading to sample complexities similar to those in the full state setting. However, moving beyond carefully curated pixel data sets (centered crop, appropriate lighting, clear background, etc.) remains challenging. In this paper, we adopt a more difficult setting, incorporating background distractors, as a first step towards addressing this challenge. We start by exploring a simple baseline approach that does not use metric-based learning, data augmentations, world-model learning, or contrastive learning. We then analyze when and why previously proposed methods are likely to fail or reduce to the same performance as the baseline in this harder setting and why we should think carefully about extending such methods beyond the well curated environments. Our results show that finer categorization of benchmarks on the basis of characteristics like density of reward, planning horizon of the problem, presence of task-irrelevant components, etc., is crucial in evaluating algorithms. Based on these observations, we propose different metrics to consider when evaluating an algorithm on benchmark tasks. We hope such a data-centric view can motivate researchers to rethink representation learning when investigating how to best apply RL to real-world tasks.

URL: https://openreview.net/forum?id=wIXHG8LZ2w

---

Title: Blind Sequence Denoising with Self-Supervised Set Learning

Abstract: Denoising discrete-valued sequences typically relies on training a supervised model on ground-truth sources or fitting a statistical model of a noisy channel. Biological sequence analysis presents a unique challenge for both approaches, as obtaining ground-truth sequences is resource-intensive and the complexity of sequencing errors makes it difficult to specify an accurate noise model. Recent developments in DNA sequencing have opened an avenue for tackling this problem by producing long DNA reads consisting of multiple subreads, or noisy observations of the same sequence, that can be denoised together. Inspired by this context, we propose a novel method for denoising sets of sequences that does not require access to clean sources. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent space and sequence space. This set embedding represents the “average” of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL-denoised sequences contain 31% fewer errors compared to a traditional denoising algorithm based on a multi-sequence alignment (MSA) of the subreads. When very few subreads are available or high error rates lead to poor alignment, SSSL reduces errors by an even greater margin. On an experimental dataset of antibody sequences, SSSL improves over the MSA-based algorithm on two proposed self-supervised metrics, with a significant difference on difficult reads with fewer than ten subreads that comprise over 75% of the test set. SSSL promises to better realize the potential of high-throughput DNA sequencing data

URL: https://openreview.net/forum?id=JHTnT9vezG

---

Title: Workflow Discovery from Dialogues in the Low Data Regime

Abstract: Text-based dialogues are now widely used to solve real-world problems. In cases where solution strategies are already known, they can sometimes be codified into workflows and used to guide humans or artificial agents through the task of helping clients. We introduce a new problem formulation that we call Workflow Discovery (WD) in which we are interested in the situation where a formal workflow may not yet exist. Still, we wish to discover the set of actions that have been taken to resolve a particular problem. We also examine a sequence-to-sequence (Seq2Seq) approach for this novel task. We present experiments where we extract workflows from dialogues in the Action-Based Conversations Dataset (ABCD). Since the ABCD dialogues follow known workflows to guide agents, we can evaluate our ability to extract such workflows using ground truth sequences of actions. We propose and evaluate an approach that conditions models on the set of possible actions, and we show that using this strategy, we can improve WD performance. Our conditioning approach also improves zero-shot and few-shot WD performance when transferring learned models to unseen domains within and across datasets. Further, a modified variant of our Seq2Seq method achieves state-of-the-art performance on related but different problems of Action State Tracking (AST) and Cascading Dialogue Success (CDS) on the ABCD dataset.

URL: https://openreview.net/forum?id=L9othQvPks

---

Title: $k$-Mixup Regularization for Deep Learning via Optimal Transport

Abstract: Mixup is a popular regularization technique for training deep neural networks that improves generalization and increases adversarial robustness. It perturbs input training data in the direction of other randomly-chosen instances in the training set. To better leverage the structure of the data, we extend mixup in a simple, broadly applicable way to $k$-mixup, which perturbs $k$-batches of training points in the direction of other $k$-batches. The perturbation is done with displacement interpolation, i.e. interpolation under the Wasserstein metric. We demonstrate theoretically and in simulations that $k$-mixup preserves cluster and manifold structures, and we extend theory studying the efficacy of standard mixup to the $k$-mixup case. Our empirical results show that training with $k$-mixup further improves generalization and robustness across several network architectures and benchmark datasets of differing modalities. It generally produces similar performance gains over standard mixup as those seen by mixup itself over standard ERM.

URL: https://openreview.net/forum?id=4Cx6GHd99J

---

Reply all

Reply to author

Forward

0 new messages