Weekly TMLR digest for Nov 20, 2022

3 views

Skip to first unread message

TMLR

unread,

Nov 19, 2022, 7:00:11 PM11/19/22

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: Action Noise in Off-Policy Deep Reinforcement Learning: Impact on Exploration and Performance

Jakob Hollenstein, Sayantan Auddy, Matteo Saveriano, Erwan Renaudo, Justus Piater

https://openreview.net/forum?id=NljBlZ6hmG

---

Accepted papers
===============

Title: Complex-Valued Autoencoders for Object Discovery

Authors: Sindy Löwe, Phillip Lippe, Maja Rudolph, Max Welling

Abstract: Object-centric representations form the basis of human perception, and enable us to reason about the world and to systematically generalize to new settings. Currently, most works on unsupervised object discovery focus on slot-based approaches, which explicitly separate the latent representations of individual objects. While the result is easily interpretable, it usually requires the design of involved architectures. In contrast to this, we propose a comparatively simple approach – the Complex AutoEncoder (CAE) – that creates distributed object-centric representations. Following a coding scheme theorized to underlie object representations in biological neurons, its complex-valued activations represent two messages: their magnitudes express the presence of a feature, while the relative phase differences between neurons express which features should be bound together to create joint object representations. In contrast to previous approaches using complex-valued activations for object discovery, we present a fully unsupervised approach that is trained end-to-end – resulting in significant improvements in performance and efficiency. Further, we show that the CAE achieves competitive or better unsupervised object discovery performance on simple multi-object datasets compared to a state-of-the-art slot-based approach while being up to 100 times faster to train.

URL: https://openreview.net/forum?id=1PfcmFTXoa

---

Title: Learning Algorithms for Markovian Bandits:\\Is Posterior Sampling more Scalable than Optimism?

Authors: Nicolas Gast, Bruno Gaujal, Kimang Khun

Abstract: In this paper, we study the scalability of model-based algorithms learning the optimal policy of a discounted \blue{rested} Markovian bandit problem with $n$ arms. There are two categories of model-based reinforcement learning algorithms: Bayesian algorithms (like PSRL), and optimistic algorithms (like UCRL2 or UCBVI). A naive application of these algorithms is not scalable because the state-space is exponential in $n$. In this paper, we construct variants of these algorithms specially tailored to Markovian bandits (MB) that we call MB-PSRL, MB-UCRL2, and MB-UCBVI. \blue{We consider an episodic setting with geometrically distributed episode length, and measure the performance of the algorithm in terms of regret (Bayesian regret for MB-PSRL and expected regret for MB-UCRL2 and MB-UCBVI)}. We prove that, for this setting, all algorithms have a low regret in $\tilde{O}(S\sqrt{nK})$ -- where $K$ is the number of episodes, $n$ is the number of arms and $S$ is the number of states of each arm. Up to a factor $\sqrt{S}$, these regrets match the \blue{Bayesian minimax regret} lower bound of $\Omega(\sqrt{SnK})$ that we also derive.

Even if their theoretical regrets are comparable, the {\it time complexities} of these algorithms vary greatly: We show that MB-UCRL2, as well as all algorithms that use bonuses on transition matrices have a { time} complexity that grows exponentially in $n$. In contrast, MB-UCBVI does not use bonuses on transition matrices and we show that it can be implemented efficiently, with a time complexity linear in $n$. Our numerical experiments show, however, that its empirical regret is large. Our Bayesian algorithm, MB-PSRL, enjoys the best of both worlds: its running time is linear in the number of arms and its empirical regret is the smallest of all algorithms.
This is a new addition in the understanding of the power of Bayesian algorithms, that can often be tailored to the structure of the problems to learn.

URL: https://openreview.net/forum?id=Sh3RF9JowK

---

Title: Modeling Object Dissimilarity for Deep Saliency Prediction

Authors: Bahar Aydemir, Deblina Bhattacharjee, Tong Zhang, Seungryong Kim, Mathieu Salzmann, Sabine Süsstrunk

Abstract: Saliency prediction has made great strides over the past two decades, with current techniques modeling low-level information, such as color, intensity and size contrasts, and high-level ones, such as attention and gaze direction for entire objects. Despite this, these methods fail to account for the dissimilarity between objects, which affects human visual attention. In this paper, we introduce a detection-guided saliency prediction network that explicitly models the differences between multiple objects, such as their appearance and size dissimilarities. Our approach allows us to fuse our object dissimilarities with features extracted by any deep saliency prediction network. As evidenced by our experiments, this consistently boosts the accuracy of the baseline networks, enabling us to outperform the state-of-the-art models on three saliency benchmarks, namely SALICON, MIT300 and CAT2000. Our project page is at https://github.com/IVRL/DisSal.

URL: https://openreview.net/forum?id=NmTMc3uD1G

---

Title: Optimizing Intermediate Representations of Generative Models for Phase Retrieval

Authors: Tobias Uelwer, Sebastian Konietzny, Stefan Harmeling

Abstract: Phase retrieval is the problem of reconstructing images from magnitude-only measurements. In many real-world applications the problem is underdetermined. When training data is available, generative models allow optimization in a lower-dimensional latent space, hereby constraining the solution set to those images that can be synthesized by the generative model. However, not all possible solutions are within the range of the generator. Instead, they are represented with some error. To reduce this representation error in the context of phase retrieval, we first leverage a novel variation of intermediate layer optimization (ILO) to extend the range of the generator while still producing images consistent with the training data. Second, we introduce new initialization schemes that further improve the quality of the reconstruction. With extensive experiments on the Fourier phase retrieval problem and thorough ablation studies, we can show the benefits of our modified ILO and the new initialization schemes. Additionally, we analyze the performance of our approach on the Gaussian phase retrieval problem.

URL: https://openreview.net/forum?id=YAVE6jfeJb

---

Title: Algorithms and Theory for Supervised Gradual Domain Adaptation

Authors: Jing Dong, Shiji Zhou, Baoxiang Wang, Han Zhao

Abstract: The phenomenon of data distribution evolving over time has been observed in a range of applications, calling the needs of adaptive learning algorithms. We thus study the problem of supervised gradual domain adaptation, where labeled data from shifting distributions are available to the learner along the trajectory, and we aim to learn a classifier on a target data distribution of interest. Under this setting, we provide the first generalization upper bound on the learning error under mild assumptions. Our results are algorithm agnostic, general for a range of loss functions, and only depend linearly on the averaged learning error across the trajectory. This shows significant improvement compared to the previous upper bound for unsupervised gradual domain adaptation, where the learning error on the target domain depends exponentially on the initial error on the source domain. Compared with the offline setting of learning from multiple domains, our results also suggest the potential benefits of the temporal structure among different domains in adapting to the target one. Empirically, our theoretical results imply that learning proper representations across the domains will effectively mitigate the learning errors. Motivated by these theoretical insights, we propose a min-max learning objective to learn the representation and classifier simultaneously. Experimental results on both semi-synthetic and large-scale real datasets corroborate our findings and demonstrate the effectiveness of our objectives.

URL: https://openreview.net/forum?id=35y5hv9fbb

---

Title: Teacher’s pet: understanding and mitigating biases in distillation

Authors: Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

Abstract: Knowledge distillation is widely used as a means of improving the performance of a relatively simple ``student'' model using the predictions from a complex ``teacher'' model. Several works have shown that distillation significantly boosts the student's \emph{overall} performance; however, are these gains uniform across all data subgroups? In this paper, we show that distillation can \emph{harm} performance on certain subgroups, {e.g., classes with few associated samples}, compared to the vanilla student trained using the one-hot labels. We trace this behaviour to errors made by the teacher distribution being transferred to and \emph{amplified} by the student model, and formally prove that distillation can indeed harm underrepresented subgroups in certain regression settings. To mitigate this problem, we present techniques which soften the teacher influence for subgroups where it is less reliable. Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy, while additionally ensuring improvement in subgroup performance.

URL: https://openreview.net/forum?id=ph3AYXpwEb

---

Title: Action Noise in Off-Policy Deep Reinforcement Learning: Impact on Exploration and Performance

Authors: Jakob Hollenstein, Sayantan Auddy, Matteo Saveriano, Erwan Renaudo, Justus Piater

Abstract: Many Deep Reinforcement Learning (D-RL) algorithms rely on simple forms of exploration
such as the additive action noise often used in continuous control domains. Typically,
the scaling factor of this action noise is chosen as a hyper-parameter and is kept constant
during training. In this paper, we focus on action noise in off-policy deep reinforcement
learning for continuous control. We analyze how the learned policy is impacted by the noise
type, noise scale, and impact scaling factor reduction schedule. We consider the two most
prominent types of action noise, Gaussian and Ornstein-Uhlenbeck noise, and perform a vast
experimental campaign by systematically varying the noise type and scale parameter, and
by measuring variables of interest like the expected return of the policy and the state-space
coverage during exploration. For the latter, we propose a novel state-space coverage measure
$\operatorname{X}_{\mathcal{U}\text{rel}}$ that is more robust to estimation artifacts caused by points close to the
state-space boundary than previously-proposed measures. Larger
noise scales generally increase state-space coverage. However, we found that increasing the
space coverage using a larger noise scale is often not beneficial. On the contrary, reducing
the noise scale over the training process reduces the variance and generally improves the
learning performance. We conclude that the best noise type and scale are environment
dependent, and based on our observations derive heuristic rules for guiding the choice of the
action noise as a starting point for further optimization.

URL: https://openreview.net/forum?id=NljBlZ6hmG

---

Title: Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities

Authors: Andreas Kirsch, Yarin Gal

Abstract: Recently proposed methods in data subset selection, that is active learning and active sampling, use Fisher information, Hessians, similarity matrices based on gradients, and gradient lengths to estimate how informative data is for a model’s training. Are these different approaches connected, and if so, how? We revisit the fundamentals of Bayesian optimal experiment design and show that these recently proposed methods can be understood as approximations to information-theoretic quantities: among them, the mutual information between predictions and model parameters, known as expected information gain or BALD in machine learning, and the mutual information between predictions of acquisition candidates and test samples, known as expected predictive information gain. We develop a comprehensive set of approximations using Fisher information and observed information and derive a unified framework that connects seemingly disparate literature. Although Bayesian methods are often seen as separate from non-Bayesian ones, the sometimes fuzzy notion of “informativeness” expressed in various non-Bayesian objectives leads to the same couple of information quantities, which were, in principle, already known by Lindley (1956) and MacKay (1992).

URL: https://openreview.net/forum?id=UVDAKQANOW

---

Title: An Efficient One-Class SVM for Novelty Detection in IoT

Authors: Kun Yang, Samory Kpotufe, Nicholas Feamster

Abstract: One-Class Support Vector Machines (OCSVM) are a common approach for novelty detection, due to their flexibility in fitting complex nonlinear boundaries between {normal} and {novel} data. Novelty detection is important in the Internet of Things (``IoT'') due to the threats these devices can present, and OCSVM often performs well in these environments due to the variety of devices, traffic patterns, and anomalies that IoT devices present. Unfortunately, conventional OCSVMs can introduce prohibitive memory and computational overhead at detection time. This work designs, implements and evaluates an efficient OCSVM for such practical settings. We extend Nystr\"om and (Gaussian) Sketching approaches to OCSVM, combining these methods with clustering and Gaussian mixture models to achieve 15-30x speedup in prediction time and 30-40x reduction in memory requirements without sacrificing detection accuracy. Here, the very nature of IoT devices is crucial: they tend to admit few modes of \emph{normal} operation, allowing for efficient pattern compression.

URL: https://openreview.net/forum?id=LFkRUCalFt

---

Title: Competition over data: how does data purchase affect users?

Authors: Yongchan Kwon, Tony A Ginart, James Zou

Abstract: As the competition among machine learning (ML) predictors is widespread in practice, it becomes increasingly important to understand the impact and biases arising from such competition. One critical aspect of ML competition is that ML predictors are constantly updated by acquiring additional data during the competition. Although this active data acquisition can largely affect the overall competition environment, it has not been well-studied before. In this paper, we study what happens when ML predictors can purchase additional data during the competition. We introduce a new environment in which ML predictors use active learning algorithms to effectively acquire labeled data within their budgets while competing against each other. We empirically show that the overall performance of an ML predictor improves when predictors can purchase additional labeled data. Surprisingly, however, the quality that users experience---i.e., the accuracy of the predictor selected by each user---can decrease even as the individual predictors get better. We demonstrate that this phenomenon naturally arises due to a trade-off whereby competition pushes each predictor to specialize in a subset of the population while data purchase has the effect of making predictors more uniform. With comprehensive experiments, we show that our findings are robust against different modeling assumptions.

URL: https://openreview.net/forum?id=63sJsCmq6Q

---

Title: Diffusion Models for Video Prediction and Infilling

Authors: Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, Andrea Dittadi

Abstract: Predicting and anticipating future outcomes or reasoning about missing information in a sequence are critical skills for agents to be able to make intelligent decisions. This requires strong, temporally coherent generative capabilities. Diffusion models have shown remarkable success in several generative tasks, but have not been extensively explored in the video domain.
We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training.
By varying the mask we condition on, the model is able to perform video prediction, infilling, and upsampling. Due to our simple conditioning scheme, we can utilize the same architecture as used for unconditional training, which allows us to train the model in a conditional and unconditional fashion at the same time. We evaluate RaMViD on two benchmark datasets for video prediction, on which we achieve state-of-the-art results, and one for video generation. High-resolution videos are provided at https://sites.google.com/view/video-diffusion-prediction.

URL: https://openreview.net/forum?id=lf0lr4AYM6

---

Title: Efficient Gradient Flows in Sliced-Wasserstein Space

Authors: Clément Bonet, Nicolas Courty, François Septier, Lucas Drumetz

Abstract: Minimizing functionals in the space of probability distributions can be done with Wasser-
stein gradient flows. To solve them numerically, a possible approach is to rely on the
Jordan–Kinderlehrer–Otto (JKO) scheme which is analogous to the proximal scheme in
Euclidean spaces. However, it requires solving a nested optimization problem at each it-
eration, and is known for its computational challenges, especially in high dimension. To
alleviate it, very recent works propose to approximate the JKO scheme leveraging Brenier’s
theorem, and using gradients of Input Convex Neural Networks to parameterize the density
(JKO-ICNN). However, this method comes with a high computational cost and stability is-
sues. Instead, this work proposes to use gradient flows in the space of probability measures
endowed with the sliced-Wasserstein (SW) distance. We argue that this method is more flex-
ible than JKO-ICNN, since SW enjoys a closed-form differentiable approximation. Thus,
the density at each step can be parameterized by any generative model which alleviates the
computational burden and makes it tractable in higher dimensions.

URL: https://openreview.net/forum?id=Au1LNKmRvh

---

New submissions
===============

Title: SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch

Abstract: Semi-supervised anomaly detection is a common problem, as often the datasets containing anomalies are partially labeled. We propose a canonical framework: Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling (SPADE) that isn't limited by the assumption that labeled and unlabeled data come from the same distribution. Indeed, the assumption is often violated in many applications - for example, the labeled data may contain only anomalies unlike unlabeled data, or unlabeled data may contain different types of anomalies, or labeled data may contain only `easy-to-label' samples. SPADE utilizes an ensemble of one class classifiers as the pseudo-labeler to improve the robustness of pseudo-labeling with distribution mismatch. Partial matching is proposed to automatically select the critical hyper-parameters for pseudo-labeling without validation data, which is crucial with limited labeled data. SPADE shows state-of-the-art semi-supervised anomaly detection performance across a wide range of scenarios with distribution mismatch in both tabular and image domains. In some common real-world settings such as model facing new types of unlabeled anomalies, SPADE outperforms the state-of-the-art alternatives by 5% AUC in average.

URL: https://openreview.net/forum?id=JwDpZSv3yz

---

Title: Language Models Can See: Plugging Visual Controls in Text Generation

Abstract: Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner.
MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, namely magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.

URL: https://openreview.net/forum?id=dtRZmxualH

---

Title: A Unified View of Masked Image Modeling

Abstract: Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8 semantic segmentation mIoU metric on ADE20k (512 size). Code is enclosed in the supplementary materials.

URL: https://openreview.net/forum?id=wmGlMhaBe0

---

Title: Risk Sensitive and Robust Dead-end Identification in Safety-Critical Offline Reinforcement Learning

Abstract: In safety-critical decision-making scenarios being able to identify worst-case outcomes, or dead-ends is crucial in order to develop safe and reliable policies in practice. These situations are typically rife with uncertainty due to unknown or stochastic characteristics of the environment as well as limited offline training data. As a result, the value of a decision at any time point should be based on the distribution of its anticipated effects. We propose a framework to identify worst-case decision points, by explicitly estimating distributions of the expected return of a decision. These estimates enable earlier indication of dead-ends in a manner that is tunable based on the risk tolerance of the designed task. We demonstrate the utility of Distributional Dead-end Discovery (DistDeD) in a toy domain as well as when assessing the risk of severely ill patients in the intensive care unit reaching a point where death is unavoidable. We find that DistDeD significantly improves over prior discovery approaches, providing indications of the risk 10 hours earlier on average as well as increasing detection by 20%.

URL: https://openreview.net/forum?id=oKlEOT83gI

---

Title: Logical Tasks for Measuring Extrapolation and Rule Comprehension

Abstract: Logical reasoning is essential in a variety of human activities. A representative example of a logical task is mathematics. Recent large-scale models trained on large datasets have been successful in various fields, but their reasoning ability in arithmetic tasks is limited, which we reproduce experimentally. Here, we recast this limitation as not unique to mathematics but common to tasks that require logical operations. We then propose a new set of tasks, termed logical tasks, which will be the next challenge to address. This higher point of view helps the development of inductive biases that have broad impact beyond the solution of individual tasks. We define and characterize logical tasks and discuss system requirements for their solution. Furthermore, we discuss the relevance of logical tasks to concepts such as extrapolation, explainability, and inductive bias. Finally, we provide directions for solving logical tasks.

URL: https://openreview.net/forum?id=1H705s0cBQ

---

Title: Optimal Threshold Labeling for Ordinal Regression Methods

Abstract: For an ordinal regression task, a classification task for ordinal data, one-dimensional transformation (1DT)-based methods are often employed since they are considered to capture the ordinal relation of ordinal data well. They learn a 1DT of the observation of the explanatory variables so that an observation with a larger class label tends to have a larger value of the 1DT, and classify the observation by labeling that learned 1DT. In this paper, we study the labeling procedure for 1DT-based methods, which have not been sufficiently discussed in existing studies. While regression-based methods and classical threshold methods conventionally use threshold labelings, which label a learned 1DT according to the rank of the interval to which the 1DT belongs among intervals on the real line separated by threshold parameters, we prove that likelihood-based labeling used in popular statistical 1DT-based methods is also a threshold labeling in typical usages. Moreover, we show that these threshold labelings can be sub-optimal ones depending on the learning result of the 1DT and the task under consideration. On the basis of these findings, we propose to apply empirical optimal threshold labeling, which is a threshold labeling that uses threshold parameters minimizing the empirical task risk for a learned 1DT, to those methods. In experiments with real-world datasets, changing the labeling procedure of existing 1DT-based methods to the proposed one improved the classification performance in many tried cases.

URL: https://openreview.net/forum?id=mHSAy1n65Z

---

Title: ES-ENAS: Efficient Evolutionary Optimization for Large-Scale Hybrid Search Spaces

Abstract: In this paper, we approach the problem of optimizing blackbox functions over large hybrid search spaces consisting of both combinatorial and continuous parameters. We demonstrate that previous evolutionary algorithms which rely on mutation-based approaches, while flexible over combinatorial spaces, suffer from a curse of dimensionality in high dimensional continuous spaces both theoretically and empirically, which thus limits their scope over hybrid search spaces as well. In order to combat this curse, we propose ES-ENAS, a simple and modular joint optimization procedure combining the class of sample-efficient smoothed gradient gradient techniques, commonly known as Evolutionary Strategies (ES), with combinatorial optimizers in a highly scalable and intuitive way, inspired by the one-shot or supernet paradigm introduced in Efficient Neural Architecture Search (ENAS). By doing so, we achieve significantly more sample efficiency, which we empirically demonstrate over synthetic benchmarks, and are further able to apply ES-ENAS for architecture search over popular RL benchmarks.

URL: https://openreview.net/forum?id=EKtlJWam6h

---

Title: L-SVRG and L-Katyusha with Adaptive Sampling

Abstract: Stochastic gradient-based optimization methods, such as L-SVRG and its accelerated variant L-Katyusha (Kovalev et al., 2020), are widely used to train machine learning models. Theoretical and empirical performance of L-SVRG and L-Katyusha can be improved by sampling the observations from a non-uniform distribution Qian et al. (2021). However, to design a desired sampling distribution, Qian et al. (2021) rely on prior knowledge of smoothness constants that can be computationally intractable to obtain in practice when the dimension of the model parameter is high. We propose an adaptive sampling strategy for L-SVRG and L-Katyusha that learns the sampling distribution with little computational overhead, while allowing it to change with iterates, and at the same time does not require any prior knowledge on the problem parameters. We prove convergence guarantees for L-SVRG and L-Katyusha for convex objectives when the sampling distribution changes with iterates. These results show that even without prior information, the proposed adaptive sampling strategy matches, and in some cases even surpasses, the performance of the sampling scheme in Qian et al. (2021). Extensive simulations support our theory and the practical utility of the proposed sampling scheme on real data.

URL: https://openreview.net/forum?id=9lyqt3rbDc

---

Title: lo-fi: distributed fine-tuning without communication

Abstract: When fine-tuning large neural networks, it is common to use multiple nodes and to communicate gradients at each optimization step. By contrast, we investigate completely local fine-tuning, which we refer to as lo-fi. During lo-fi, each node fine-tunes independently without any communication. Then, the weights are averaged across nodes at the conclusion of fine-tuning. When fine-tuning DeiT-base and DeiT-large on ImageNet, this procedure matches accuracy in-distribution and improves accuracy under distribution shift compared to the baseline, which observes the same amount of data but communicates gradients at each step. We also observe that lo-fi matches the baseline's performance when fine-tuning OPT language models (up to 1.3B parameters) on Common Crawl. By removing the communication requirement, lo-fi reduces resource barriers for fine-tuning large models and enables fine-tuning in settings with prohibitive communication cost.

URL: https://openreview.net/forum?id=1U0aPkBVz0

---

Title: Quantum Policy Iteration via Amplitude Estimation and Grover Search – Towards Quantum Advantage for Reinforcement Learning

Abstract: We present a full implementation and simulation of a novel quantum reinforcement learning (RL) method and mathematically prove a quantum advantage. Our approach shows in detail how to combine amplitude estimation and Grover search into a policy evaluation and improvement scheme. We first develop quantum policy evaluation (QPE) which is quadratically more efficient compared to an analogous classical Monte Carlo estimation and is based on a quantum mechanical realization of a finite Markov decision process (MDP). Building on QPE, we derive a quantum policy iteration that repeatedly improves an initial policy using Grover search until the optimum is reached. Finally, we present an implementation of our algorithm for a two-armed bandit MDP which we then simulate. Our work is a detailed and formal proof of concept for how quantum algorithms can be used to solve RL problems and shows that they can indeed yield provable
speedups.

URL: https://openreview.net/forum?id=HG11PAmwQ6

---

Title: Modelling sequential branching dynamics with a multivariate branching Gaussian process

Abstract: The Branching Gaussian Process (BGP) model is a modification of the Overlapping Mixture
of Gaussian Processes (OMGP) where latent functions branch in time. The BGP model
was introduced as a method to model bifurcations in single-cell gene expression data and
order genes by inferring their branching time parameter. A limitation of the current BGP
model is that the assignment of observations to latent functions is inferred independently
for each output dimension (gene). This leads to inconsistent assignments across outputs
and reduces the accuracy of branching time inference. Here, we propose a multivariate
branching Gaussian process (MBGP) model to perform joint branch assignment inference
across multiple output dimensions. This ensures that branch assignments are consistent and
leverages more data for branching time inference. Model inference is more challenging than
for the original BGP or OMGP models because assignment labels can switch from trunk to
branch lineages as branching times change during inference. To scale up inference to large
datasets we use sparse variational Bayesian inference. We examine the effectiveness of our
approach on synthetic data and a single-cell RNA-Seq dataset from mouse haematopoietic
stem cells (HSCs). Our approach ensures assignment consistency by design and achieves
improved accuracy in branching time inference and assignment accuracy.

URL: https://openreview.net/forum?id=9KoBOlstTq

---

Title: A Free Lunch with Influence Functions? An Empirical Evaluation of Influence Functions for Average Treatment Effect Estimation

Abstract: The applications of causal inference may be life-critical, including the evaluation of vaccinations, medicine, and social policy. However, when undertaking estimation for causal inference, practitioners rarely have access to what might be called `ground-truth' in a supervised learning setting, meaning the chosen estimation methods cannot be evaluated and must be assumed to be reliable. It is therefore crucial that we have a good understanding of the performance consistency of typical methods available to practitioners. In this work we provide a comprehensive evaluation of recent semiparametric methods (including neural network approaches) for average treatment effect estimation. Such methods have been proposed as a means to derive unbiased causal effect estimates and statistically valid confidence intervals, even when using otherwise non-parametric, data-adaptive machine learning techniques. We also propose a new estimator `MultiNet', and a variation on the semiparametric update step `MultiStep', which we evaluate alongside existing approaches. The performance of both semiparametric and `regular' methods are found to be dataset dependent, indicating an interaction between the methods used, the sample size, and nature of the data generating process. Our experiments highlight the need for practitioners to check the consistency of their findings, potentially by undertaking multiple analyses with different combinations of estimators.

URL: https://openreview.net/forum?id=dQxBRqCjLr

---

Title: Cherry Hypothesis : Identifying the Cherry on the Cake for Dynamic Networks

Abstract: Dynamic networks, e.g., Dynamic Convolution (DY-Conv) and the Mixture of Experts (MoE), have been extensively explored as they can considerably improve the model's representation power with acceptable computational cost. The common practice in implementing dynamic networks is to convert given static layers into fully dynamic ones where all parameters are dynamic (at least within a single layer) and vary with the input. Recent studies empirically show the trend that the more dynamic layers contribute to ever-increasing performance. However, such a fully dynamic setting 1) may cause redundant parameters and high deployment costs, limiting the applicability of dynamic networks to a broader range of tasks and models, and more importantly, 2) contradicts the previous discovery in the human brain that \textit{when human brains process an attention-demanding task, only partial neurons in the task-specific areas are activated by the input, while the rest neurons leave in a baseline state.} Critically, there is no effort to understand and resolve the above contradictory finding, leaving the primal question -- whether to make the computational parameters fully dynamic or not? -- unanswered. The main contributions of our work are challenging the basic commonsense in dynamic networks, and, proposing and validating the \textsc{cherry hypothesis} -- \textit{A fully dynamic network contains a subset of dynamic parameters that when transforming other dynamic parameters into static ones, can maintain or even exceed the performance of the original network.} Technically, we propose a brain-inspired partially dynamic network, namely PAD-Net, to transform the redundant dynamic parameters into static ones. Also, we further design Iterative Mode Partition to partition the dynamic- and static-subnet, which alleviates the redundancy in traditional fully dynamic networks. Our hypothesis and method are comprehensively supported by large-scale experiments with two typical advanced dynamic methods, i.e., DY-Conv and MoE, on both image classification and GLUE benchmarks. Encouragingly, we surpass the fully dynamic networks by $+0.7\%$ top-1 acc with only $30\%$ dynamic parameters for ResNet-50 and $+1.9\%$ average score in language understanding tasks with only $50\%$ dynamic parameters for BERT-base. As for reproducibility, the code has been uploaded to OpenReview and will be released upon acceptance.

URL: https://openreview.net/forum?id=70E72aMWVO

---

Title: Optimizing Model-Agnostic Random Subspace Ensembles

Abstract: This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. Parameter optimization is performed using gradient descent and is rendered tractable by using an importance sampling approach that circumvents frequent re-training of the base models after each gradient descent step. While the degree of randomization is controlled by a hyper-parameter in standard Random Subspace, it has the advantage to be automatically tuned in our parametric version. Furthermore, model-agnostic feature importance scores can be easily derived from the trained ensemble model. The optimization algorithm can also easily incorporate any differentiable regularization term to impose constraints on these importance scores.

URL: https://openreview.net/forum?id=zhrvDUhjNn

---

Title: LINDA: Unsupervised Learning to Interpolate in Natural Language Processing

Abstract: Despite the success of mixup in data augmentation, its applicability to natural language processing (NLP) tasks has been limited due to the discrete and variable-length nature of natural languages.
Recent studies have thus relied on domain-specific heuristics and manually crafted resources, such as dictionaries, in order to apply mixup in NLP.
In this paper, we instead propose an unsupervised learning approach to text interpolation for the purpose of data augmentation, to which we refer as `Learning to INterpolate for Data Augmentation' (LINDA), that does not require any heuristics nor manually crafted resources but learns to interpolate between any pair of natural language sentences over a natural language manifold.
After empirically demonstrating the LINDA's interpolation capability, we show that LINDA indeed allows us to seamlessly apply mixup in NLP and leads to better generalization in text classification both in-domain and out-of-domain.

URL: https://openreview.net/forum?id=KwrWgXrbt4

---

Title: Bidirectional View based Consistency Regularization for Semi-Supervised Domain Adaptation

Abstract: Distinguished from unsupervised domain adaptation (UDA), semi-supervised domain adaptation (SSDA) could access a few labeled target samples during learning additionally. Although achieving remarkable progress, target supervised information is easily overwhelmed by massive source supervised information, as there are many more labeled source samples than those in the target domain. In this work, we propose a novel method BVCR that better utilizes the supervised information by three schemes, i.e., modeling, exploration, and interaction. In the modeling scheme, BVCR models the source supervision and target supervision separately to avoid target supervised information being overwhelmed by source supervised information and better utilize the target supervision. Besides, as both supervised information naturally offer distinct views for the target domain, the exploration scheme performs intra-domain consistency regularization to better explore target information with bidirectional views. Moreover, as both views are complementary to each other, the interaction scheme introduces inter-domain consistency regularization to activate information interaction bidirectionally. Thus, the proposed method is elegantly symmetrical by design and easy to implement. Extensive experiments are conducted, and the results show the effectiveness of the proposed method.

URL: https://openreview.net/forum?id=WVwnccBJLz

---

Title: Transport Score Climbing: Variational Inference Using Forward KL and Adaptive Neural Transport

Abstract: Variational inference often minimizes the ``reverse'' Kullbeck-Leibler (KL) $D_{KL}(q||p)$ from the approximate distribution $q$ to the posterior $p$. Recent work studies the ``forward'' KL $D_{KL}(p||q)$, which unlike reverse KL does not lead to variational approximations that underestimate uncertainty. Markov chain Monte Carlo (MCMC) methods were used to evaluate the expectation in computing the forward KL. This paper introduces Transport Score Climbing (TSC), a method that optimizes $D_{KL}(p||q)$ by using Hamiltonian Monte Carlo (HMC) but running the HMC chain on a transformed, or warped, space. A function called the transport map performs the transformation by acting as a change-of-variable from the latent variable space. TSC uses HMC samples to dynamically train the transport map while optimizing $D_{KL}(p||q)$. TSC leverages synergies, where better transport maps lead to better HMC sampling, which then leads to better transport maps. We demonstrate TSC on synthetic and real data, including using TSC to train variational auto-encoders. We find that TSC achieves competitive performance on the experiments.

URL: https://openreview.net/forum?id=zfBW39xZ2E

---

Title: TESH-GCN: Text Enriched Sparse Hyperbolic Graph Convolutional Networks

Abstract: Heterogeneous networks, which connect informative nodes containing text with different edge types, are routinely used to store and process information in various real-world applications. Graph Neural Networks (GNNs) and their hyperbolic variants provide a promising approach to encode such networks in a low-dimensional latent space through neighborhood aggregation and hierarchical feature extraction, respectively. However, these approaches typically ignore metapath structures and the available semantic information. Furthermore, these approaches are sensitive to the noise present in the training data. To tackle these limitations, in this paper, we propose Text Enriched Sparse Hyperbolic Graph Convolution Network (TESH-GCN) to capture the graph’s metapath structures using semantic signals and further improve prediction in large heterogeneous graphs. In TESH-GCN, we extract semantic node information, which successively acts as a connection signal to extract relevant nodes’ local neighborhood and graph-level metapath features from the sparse adjacency tensor in a reformulated hyperbolic graph convolution layer. These extracted features in conjunction with semantic features from the language model (for robustness) are used for the final downstream task. Experiments on various heterogeneous graph datasets show that our model outperforms the current state-of-the-art approaches by a large margin on the task of link prediction. We also report a reduction in both the training time and model parameters compared to the existing hyperbolic approaches through a reformulated hyperbolic graph convolution. Furthermore, we illustrate the robustness of our model by experimenting with different levels of simulated noise in both the graph structure and text, and also, present a mechanism to explain TESH-GCN’s prediction by analyzing the extracted metapaths.

URL: https://openreview.net/forum?id=gYo8y67uNe

---

Title: DisCo: Improving Compositional Generalization in Visual Reasoning through DIStribution COverage

Abstract: We present DisCo, a learning paradigm for improving compositional generalization of visual reasoning models by leveraging unlabeled, out-of-distribution images. DisCo has two components. The first is an iterative pseudo-labeling framework with an entropy measure, which effectively labels images of novel attribute compositions paired with randomly sampled questions. The second is a distribution coverage metric, serving as a model selection strategy that approximates generalization capability to out-of-distribution test examples, without the use of labeled data from the test distribution. Both components are built on strong empirical evidence of the correlation between the chosen metric and model generalization, and improve distribution coverage on unlabeled images. We apply DisCo to visual question answering, with three backbone networks (FiLM, TbD-net, and the Neuro-Symbolic Concept Learner), and demonstrate that it consistently enhances performance on a variety of compositional generalization tasks with varying levels of train data bias.

URL: https://openreview.net/forum?id=EgHnKOLaKW

---

Title: Probing Predictions on OOD Images via Nearest Categories

Abstract: We study out-of-distribution (OOD) prediction behavior of neural networks when they classify images from unseen classes or corrupted images. To probe the OOD behavior, we introduce a new measure, nearest category generalization (NCG), where we compute the fraction of OOD inputs that are classified with the same label as their nearest neighbor in the training set. Our motivation stems from understanding the prediction patterns of adversarially robust networks, since previous work has identified unexpected consequences of training to be robust to norm-bounded perturbations. We find that robust networks have consistently higher NCG accuracy than natural training, even when the OOD data is much farther away than the robustness radius. This implies that the local regularization of robust training has a significant impact on the network’s decision regions. We replicate our findings using many datasets, comparing new and existing training methods. Overall, adversarially robust networks resemble a nearest neighbor classifier when it comes to OOD data.

URL: https://openreview.net/forum?id=fTNorIvVXG

---

Title: Learning Identity-Preserving Transformations on Data Manifolds

Abstract: Many machine learning techniques incorporate identity-preserving transformations into their models to generalize their performance to previously unseen data. These transformations are typically selected from a set of functions that are known to maintain the identity of an input when applied (e.g., rotation, translation, flipping, and scaling). However, there are many natural variations that cannot be labeled for supervision or defined through examination of the data. As suggested by the manifold hypothesis, many of these natural variations live on or near a low-dimensional, nonlinear manifold. Several techniques represent manifold variations through a set of learned Lie group operators that define directions of motion on the manifold. However, these approaches are limited because they require transformation labels when training their models and they lack a method for determining which regions of the manifold are appropriate for applying each specific operator. We address these limitations by introducing a learning strategy that does not require transformation labels and developing a method that learns the local regions where each operator is likely to be used while preserving the identity of inputs. Experiments on MNIST and Fashion MNIST highlight our model's ability to learn identity-preserving transformations on multi-class datasets. Additionally, we train on CelebA to showcase our model's ability to learn semantically meaningful transformations on complex datasets in an unsupervised manner.

URL: https://openreview.net/forum?id=gyhiZYrk5y

---

Title: VRNN’s got a GAN: Generating Time Series using Variational Recurrent Neural Models with Adversarial Training

Abstract: Time-series data generation is a machine learning task growing in popularity, and
has been a focus of deep generative methods. The task is especially important
in fields where large amounts of training data are not available, and in applica-
tions where privacy preservation using synthetic data is preferred. In the past,
generative adversarial models (GANs) were combined with recurrent neural net-
works (RNNs) to produce realistic time-series data. Moreover, RNNs with time-
step variational autoencoders were shown to have the ability to produce diverse
temporal realizations. In this paper, we propose a novel data generating model,
dubbed VRNN-GAN, that employs an adversarial framework with an RNN-based
Variational Autoencoder (VAE) serving as the generator and a bidirectional RNN
serving as the discriminator. The recurrent VAE captures temporal dynamics into
a learned time-varying latent space while the adversarial training encourages the
generation of realistic time-series data. We compared the performance of VRNN-
GAN to state-of-the-art deep generative methods on the task of generating syn-
thetic time-series data. We show that VRNN-GAN achieves the best predictive
score across all methods and yields competitive results in other well-established
performance measures compared to the state-of-the-art.

URL: https://openreview.net/forum?id=JjNNIyKtiM

---

Title: StructCoder: Structure-Aware Transformer for Code Gener-ation

Abstract: There has been a recent surge of interest in automating software engineering tasks using deep learning. This paper addresses the problem of code generation where the goal is to generate target code given source code in a different language or a natural language description. Most of the state-of-the-art deep learning models for code generation use training strategies primarily designed for natural language. However, understanding and generating code requires a more rigorous comprehension of the code syntax and semantics. With this motivation, we develop an encoder-decoder Transformer model where both the encoder and decoder are explicitly trained to recognize the syntax and data flow in the source and target codes, respectively. We not only make the encoder structure-aware by leveraging the source code's syntax tree and data flow graph, but we also support the decoder in preserving the syntax and data flow of the target code by introducing two novel auxiliary tasks: AST (Abstract Syntax Tree) paths prediction and data flow prediction. To the best of our knowledge, this is the first work to introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark, and improves over baselines of similar size on the APPS code generation benchmark.

URL: https://openreview.net/forum?id=ImTbXt1lWG

---

Title: To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online

Abstract: Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conducted surveys of reviewers in two top-tier double-blind computer science conferences—ICML 2021 (5361 submissions and 4699 reviewers) and EC 2021 (498 submissions and 190 reviewers). Our two main findings are as follows. First, more than a third of the reviewers self-report searching online for a paper they are assigned to review. Second, outside the review process, we find that preprints from better-ranked affiliations see a weakly higher visibility, with a correlation of 0.06 in ICML and 0.05 in EC. In particular, papers associated with the top-10-ranked affiliations had a visibility of approximately 11% in ICML and 22% in EC, whereas the remaining papers had a visibilityof 7% and 18% respectively.

URL: https://openreview.net/forum?id=ywiegsPRSF

---

Title: Escaping the Big Data Paradigm with Compact Transformers

Abstract: With the rise of Transformers as the standard for language processing, and their advancements in computer vision, there has been a corresponding growth in parameter size and amounts of training data. Many have come to believe that because of this, transformers are not suitable for small sets of data.
This trend leads to concerns such as: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field.
In this paper, we aim to present an approach for small-scale learning by introducing Compact Transformers.
We show for the first time that with the right size, convolutional tokenization, transformers can avoid overfitting and outperform state-of-the-art CNNs on small datasets.
Our models are flexible in terms of model size, and can have as little as 0.28M parameters while achieving competitive results.
Our best model can reach 98% accuracy when training from scratch on CIFAR-10 with only 3.7M parameters, which is a significant improvement in data-efficiency over previous Transformer based models being over 10x smaller than other transformers and is 15% the size of ResNet50 while achieving similar performance.
CCT also outperforms many modern CNN based approaches, and even some recent NAS-based approaches.
Additionally, we obtain a new SOTA result on Flowers-102 with 99.76% top-1 accuracy, and improve upon the existing baseline on ImageNet (82.71% accuracy with 29% as many parameters as ViT), as well as NLP tasks.
Our simple and compact design for transformers makes them more feasible to study for those with limited computing resources and/or dealing with small datasets, while extending existing research efforts in data efficient transformers.

URL: https://openreview.net/forum?id=kLZsLlIpDU

---

Reply all

Reply to author

Forward

0 new messages