Weekly TMLR digest for Feb 12, 2023

4 views

Skip to first unread message

TMLR

unread,

Feb 11, 2023, 7:00:13 PM2/11/23

to tmlr-annou...@googlegroups.com

Accepted papers
===============

Title: Beyond Intuition: Rethinking Token Attributions inside Transformers

Authors: Jiamin Chen, Xuhong Li, Lei Yu, Dejing Dou, Haoyi Xiong

Abstract: The multi-head attention mechanism, or rather the Transformer-based models have always been under the spotlight, not only in the domain of text processing, but also for computer vision. Several works have recently been proposed around exploring the token attributions along the intrinsic decision process. However, the ambiguity of the expression formulation can lead to an accumulation of error, which makes the interpretation less trustworthy and less applicable to different variants. In this work, we propose a novel method to approximate token contributions inside Transformers. We start from the partial derivative to each token, divide the interpretation process into attention perception and reasoning feedback with the chain rule and explore each part individually with explicit mathematical derivations. In attention perception, we propose the head-wise and token-wise approximations in order to learn how the tokens interact to form the pooled vector. As for reasoning feedback, we adopt a noise-decreasing strategy by applying the integrated gradients to the last attention map. Our method is further validated qualitatively and quantitatively through the faithfulness evaluations across different settings: single modality (BERT and ViT) and bi-modality (CLIP), different model sizes (ViT-L) and different pooling strategies (ViT-MAE) to demonstrate the broad applicability and clear improvements over existing methods.

URL: https://openreview.net/forum?id=rm0zIzlhcX

---

Title: Understanding and Simplifying Architecture Search in Spatio-Temporal Graph Neural Networks

Authors: Zhen Xu, quanming yao, Yong Li, Qiang Yang

Abstract: Compiling together spatial and temporal modules via a unified framework, Spatio-Temporal Graph Neural Networks (STGNNs) have been popularly used in the multivariate spatio-temporal forecasting task, e.g. traffic prediction. After the numerous propositions of manually designed architectures, researchers show interest in the Neural Architecture Search (NAS) of STGNNs. Existing methods suffer from two issues: (1) hyperparameters like learning rate, channel size cannot be integrated into the NAS framework, which makes the model evaluation less accurate, potentially misleading the architecture search (2) the current search space, which basically mimics Darts-like methods, is too large for the search algorithm to find a sufficiently good candidate. In this work, we deal with both issues at the same time. We first re-examine the importance and transferability of the training hyperparameters to ensure a fair and fast comparison. Next, we set up a framework that disentangles architecture design into three disjoint angles according to how spatio-temporal representations flow and transform in architectures, which allows us to understand the behavior of architectures from a distributional perspective. This way, we can obtain good guidelines to reduce the STGNN search space and find state-of-the-art architectures by simple random search. As an illustrative example, we combine these principles with random search which already significantly outperforms both state-of-the-art hand-designed models and recently automatically searched ones.

URL: https://openreview.net/forum?id=4jEuiMPKSF

---

Title: Finite-Time Analysis of Decentralized Single-Timescale Actor-Critic

Authors: qijun luo, Xiao Li

Abstract: Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. Most of the existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, and this is the case even for centralized AC algorithm under a single-agent setting. In practice, the *single-timescale* update is widely utilized, where actor and critic are updated in an alternating manner with step sizes being of the same order. In this work, we study a decentralized *single-timescale* AC algorithm. Theoretically, using linear approximation for value and reward estimation, we show that the algorithm has sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ under Markovian sampling, which matches the optimal complexity with a double-loop implementation (here, $\tilde{\mathcal{O}}$ hides a logarithmic term). When we reduce to the single-agent setting, our result yields new sample complexity for centralized AC using a single-timescale update scheme. The central to establishing our complexity results is *the hidden smoothness of the optimal critic variable* we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.

URL: https://openreview.net/forum?id=KQRv0O8iW4

---

Title: Supervised Feature Selection with Neuron Evolution in Sparse Neural Networks

Authors: Zahra Atashgahi, Xuhao Zhang, Neil Kichler, Shiwei Liu, Lu Yin, Mykola Pechenizkiy, Raymond Veldhuis, Decebal Constantin Mocanu

Abstract: Feature selection that selects an informative subset of variables from data not only enhances the model interpretability and performance but also alleviates the resource demands. Recently, there has been growing attention in feature selection using neural networks. However, existing methods usually suffer from high computational costs when applied to high-dimensional datasets. In this paper, inspired by evolution processes, we propose a novel resource-efficient supervised feature selection method using sparse neural networks, named "NeuroFS". By gradually pruning the uninformative features from the input layer of a sparse neural network trained from scratch, NeuroFS derives an informative subset of features efficiently. By performing several experiments on $11$ low and high-dimensional real-world benchmarks of different types, we demonstrate that NeuroFS achieves the highest ranking-based score among the considered state-of-the-art supervised feature selection models. We will make the code publicly available on GitHub after acceptance of the paper.

URL: https://openreview.net/forum?id=GcO6ugrLKp

---

New submissions
===============

Title: PAC-Bayes bounds for Unbounded Losses through Supermartingales

Abstract: While PAC-Bayes is now an established learning framework for bounded losses, its extension to the case of unbounded losses (as simple as the squared loss on an unbounded space) remains largely uncharted and has attracted a growing interest in recent years. We contribute to this line of work by extending to any data distribution (in particular dependent ones) the idea of \cite{kuzborskij2019efron}: PAC-Bayes provides generalisation bounds for unbounded losses with the sole assumption of bounded variance of the loss function, which is less restrictive for a number of contemporary learning problems. Our technical innovation consists in exploiting an extention of Markov's inequality for supermartingales. Our proof technique unifies and extends different PAC-Bayesian frameworks by providing bounds for unbounded martingales as well as bounds for batch and online learning with unbounded losses.

URL: https://openreview.net/forum?id=qxrwt6F3sf

---

Title: Multi-label Node Classification On Graph-Structured Data

Abstract: Graph Neural Networks (GNNs) have shown state-of-the-art improvements in node classification tasks on graphs. While these improvements have been largely demonstrated in a multi-class classification scenario, a more general and realistic scenario in which each node could have multiple labels has so far received little attention. The first challenge in conducting focused studies on multi-label node classification is the limited number of publicly available multi-label graph datasets. Therefore, as our first contribution, we collect and release three real-world biological datasets and develop a multi-label graph generator to generate datasets with tunable properties. While high label similarity (high homophily) is usually attributed to the success of GNNs, we argue that a multi-label scenario does not follow the usual semantics of homophily and heterophily so far defined for a multi-class scenario. As our second contribution, besides defining homophily for the multi-label scenario, we develop a new approach that dynamically fuses the feature and label correlation information to learn label-informed representations. Finally, we perform a large-scale comparative study with $10$ methods and $9$ datasets which also showcase the effectiveness of our approach. We release our benchmark at \url{https://anonymous.4open.science/r/LFLF-5D8C/}.

URL: https://openreview.net/forum?id=EZhkV2BjDP

---

Title: Inherent Limits on Topology-Based Link Prediction

Abstract: Link prediction systems (e.g. recommender systems) typically use graph topology as one of their main sources of information. However, automorphisms and related properties of graphs beget inherent limits in predictability. We calculate hard upper bounds on how well graph topology alone enables link prediction for a wide variety of real-world graphs. We find that in the sparsest of these graphs the upper bounds are surprisingly low, thereby demonstrating that prediction systems on sparse graph data are inherently limited and require information in addition to the graph topology.

URL: https://openreview.net/forum?id=izL3B8dPx1

---

Title: Patch-Wise Random and Noisy CutMix for Privacy-Preserving Split Learning with Vision Transformer

Abstract: In computer vision, the vision transformer (ViT) has increasingly superseded the convolutional neural network (CNN) for improved accuracy and robustness. Since ViT often comes with large model sizes and high sample complexity, split learning (SL) is a promising solution to training ViT using large memory and computing resources at a server with the sheer amount of private data owned by users or clients. In SL, a ViT is split into two parts under a server-client architecture. The sever stores its upper segment that is associated with multiple clients each of which stores the lower segment. At the cut layer between the upper and lower segments, SL exchanges the cut-layer hidden activations in the forward propagation (FP), referred to as smashed data, and the cut-layer gradients in the backpropagation (BP), which are exposed to various attacks on private training data. To mitigate the risk of data breaches in classification tasks, inspired from the CutMix regularization, we propose a novel privacy-preserving SL framework that injects Gaussian noise into smashed data and mixes randomly chosen patches of smashed data across clients, coined DP-CutMixSL. By analysis, we prove that DP-CutMixSL is a differentially private (DP) mechanism amplifying the privacy budget with respect to membership inference attacks in FP. By simulation, we additionally show that DP-CutMixSL protects privacy from reconstruction attacks in FP and from label inference attacks in BP. Surprisingly, DP-CutMixSL even improves accuracy and robustness to imbalanced data distributions over clients, due to the regularization effect of its patch-wise random CutMix operations.

URL: https://openreview.net/forum?id=E4mUkIJ9kn

---

Title: Robust Alzheimer’s Progression Modeling using Cross-Domain Self-Supervised Deep Learning

Abstract: Developing successful artificial intelligence systems in practice depends both on robust deep learning models as well as large high quality data. Acquiring and labeling data can become prohibitively expensive and time-consuming in many real-world applications such as clinical disease models. Self-supervised learning has demonstrated great potential in increasing model accuracy and robustness in small data regimes. In addition, many clinical imaging and disease modeling applications rely heavily on regression of continuous quantities. However, the applicability of self-supervised learning for these medical-imaging regression tasks has not been extensively studied. In this study, we develop a cross-domain self-supervised learning approach for disease prognostic modeling as a regression problem using medical images as input. We demonstrate that self-supervised pre-training can improve the prediction of Alzheimer's Disease progression from brain MRI. We also show that pre-training on extended (but not labeled) brain MRI data outperforms pre-training on natural images. We further observe that the highest performance is achieved when both natural images and extended brain-MRI data are used for pre-training.

URL: https://openreview.net/forum?id=HVAeM6sNo8

---

Title: Denise: Deep Robust Principal Component Analysis for Positive Semidefinite Matrices

Abstract: The robust PCA of covariance matrices plays an essential role when isolating key explanatory features. The currently available methods for performing such a low-rank plus sparse decomposition are matrix specific, meaning, those algorithms must re-run for every new
matrix. Since these algorithms are computationally expensive, it is preferable to learn and store a function that nearly instantaneously performs this decomposition when evaluated. Therefore, we introduce Denise, a deep learning-based algorithm for robust PCA of covariance matrices, or more generally of symmetric positive semidefinite matrices, which learns precisely such a function. Theoretical guarantees for Denise are provided. These include a novel universal approximation theorem adapted to our geometric deep learning problem
and convergence to an optimal solution of the learning problem. Our experiments show that Denise matches state-of-the-art performance in terms of decomposition quality, while being approximately 2000× faster than the state-of-the-art, PCP, and 200× faster than the current speed optimized method, fast PCP.

URL: https://openreview.net/forum?id=D45gGvUZp2

---

Title: Conditional Permutation Invariant Flows

Abstract: We present a novel, conditional generative probabilistic model of set-valued data with a tractable log density. This model is a continuous normalizing flow governed by permutation equivariant dynamics. These dynamics are driven by a learnable per-set-element term and pairwise interactions, both parametrized by deep neural networks. We illustrate the utility of this model via applications including (1) complex traffic scene generation conditioned on visually specified map information, and (2) object bounding box generation conditioned directly on images. We train our model by maximizing the expected likelihood of labeled conditional data under our flow, with the aid of a penalty that ensures the dynamics are smooth and hence efficiently solvable. Our method significantly outperforms non-permutation invariant baselines in terms of log likelihood and domain-specific metrics (offroad, collision, and combined infractions), yielding realistic samples that are difficult to distinguish from real data.

URL: https://openreview.net/forum?id=DUsgPi3oCC

---

Title: GLACIAL: Granger and Learning-based Causality Analysis for Longitudinal Studies

Abstract: The Granger framework is widely used for discovering causal relations based on time-varying signals. Implementations of Granger causality (GC) are mostly developed for densely sampled timeseries data. A substantially different setting, particularly common in population health applications, is the longitudinal study design, where multiple individuals are followed and sparsely observed for a limited number of times. Longitudinal studies commonly track many variables, which are likely governed by nonlinear dynamics that might have individual-specific idiosyncrasies and exhibit both direct and indirect causes. Furthermore, real-world longitudinal data often suffer from widespread missingness. GC methods are not well-suited to handle these issues. In this paper, we propose an approach named GLACIAL (i.e. “Granger and LeArning-based CausalIty Analysis for Longitudinal studies”) to fill this methodological gap by marrying GC with a multi-task neural model. GLACIAL treats individuals as independent samples and uses the model’s average prediction accuracy on hold-out individuals to probe causal links. Input feature dropout and model interpolation are used to efficiently learn nonlinear dynamic relationships between a large number of variables and to handle missing values respectively. Additional heuristics in GLACIAL are employed to distinguish between direct and indirect causes. Extensive experiments on synthetic and real data show GLACIAL outperforming competitive baselines and confirm its utility.

URL: https://openreview.net/forum?id=kOs37EzuUE

---

Title: Stochastic Batch Acquisition: A Simple Baseline for Deep Active Learning

Abstract: We examine a simple stochastic strategy for adapting well-known single-point acquisition functions to allow batch active learning.
Unlike acquiring the top-K points from the pool set, score- or rank-based sampling takes into account that acquisition scores change as new data are acquired.
This simple strategy for adapting standard single-sample acquisition strategies performs just as well as compute-intensive state-of-the-art batch acquisition functions, like BatchBALD or BADGE while using orders of magnitude less compute.
In addition to providing a practical option for machine learning practitioners, the surprising success of the proposed method in a wide range of experimental settings raises a difficult question for the field: are expensive batch acquisition methods pulling their weight?

URL: https://openreview.net/forum?id=vcHwQyNBjW

---

Title: A Revenue Function for Comparison-Based Hierarchical Clustering

Abstract: Comparison-based learning addresses the problem of learning when, instead of explicit features or pairwise similarities, one only has access to comparisons of the form: \emph{Object $A$ is more similar to $B$ than to $C$.} Recently, it has been shown that, in Hierarchical Clustering, single and complete linkage can be directly implemented using only such comparisons while several algorithms have been proposed to emulate the behaviour of average linkage. Hence, finding hierarchies (or dendrograms) using only comparisons is a well understood problem. However, evaluating their meaningfulness when no ground-truth nor explicit similarities are available remains an open question.

In this paper, we bridge this gap by proposing a new revenue function that allows one to measure the goodness of dendrograms using only comparisons. We show that this function is closely related to Dasgupta's cost for hierarchical clustering that uses pairwise similarities.
On the theoretical side, we use the proposed revenue function to resolve the open problem of whether one can approximately recover a latent hierarchy using few triplet comparisons. On the practical side, we present principled algorithms for comparison-based hierarchical clustering based on the maximisation of the revenue and we empirically compare them with existing methods.

URL: https://openreview.net/forum?id=QzWr4w8PXx

---

Title: Identifying latent distances with Finslerian geometry

Abstract: Riemannian geometry provides powerful tools to explore the latent space of generative models while preserving the inherent structure of the data. Distance and volume measures can be computed from a Riemannian metric defined by pulling back the Euclidean metric from the data to the latent manifold.
With this in mind, most generative models are stochastic, and so is the pullback metric. Yet, manipulating stochastic objects is at best impractical, and at worst unachievable. To perform operations such as interpolations, or measuring the distance between data points, we need a deterministic approximation of the pullback metric.
In this work, we define a new metric as the expected length derived from the stochastic pullback metric. We show this metric defines a Finsler metric. We compare it with the expected pullback metric. We show that in high dimensions, the metrics converge to each other at a rate of $\mathcal{O}\left(\frac{1}{D}\right)$.

URL: https://openreview.net/forum?id=bm2XSzY6o7

---

Title: Learning Interpretable Models Using an Oracle

Abstract:
We look at a specific aspect of model interpretability: models often need to be constrained in size for
them to be considered interpretable, e.g., a decision tree of depth 5 is easier to interpret than
one of depth 50. But smaller models also tend to have high bias. This suggests a
trade-off between interpretability and accuracy. Our work addresses this by: (a) showing that learning a training distribution can often increase accuracy of small models, and therefore may be used as a strategy to compensate for small sizes, and (b) providing a model-agnostic algorithm to learn such training distributions. We also present a surprising artifact: the learned training distribution may be different from the test distribution.

We pose the distribution learning problem as one of optimizing parameters for an Infinite Beta Mixture Model based on a Dirichlet Process, so that the held-out accuracy of a model trained on a sample from this distribution is maximized. To make computation tractable, we project the training data onto one dimension: prediction uncertainty scores as provided by a highly accurate oracle model. A Bayesian Optimizer is used for learning the parameters.

Empirical results using multiple real world datasets, various oracles and interpretable models with different notions of model sizes, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than $100\%$ over baselines.

Additionally we show that the proposed algorithm provides the following benefits: (a) its a framework which allows for flexibility in implementation, (b) it can be used across feature spaces, e.g., we show that the the text classification accuracy of a Decision Tree using character n-grams improves when using a Gated Recurrent Unit as an oracle, which uses a sequence of characters as its input, (c) it can be used to train models that have a non-differentiable training loss, e.g., Decision Trees, and (d) reasonable defaults exist for most parameters of the algorithm, which makes it convenient to use.

URL: https://openreview.net/forum?id=12QKTmXjZn

---

Title: Retiring $\Delta \text{DP}$: New Distribution-Level Metrics for Demographic Parity

Abstract: Demographic parity is the most widely recognized measure of group fairness in machine learning, which ensures equal treatment of different demographic groups. Numerous works aim to achieve demographic parity by pursuing the commonly used metric $\Delta DP$. Unfortunately, in this paper, we reveal that the fairness metric $\Delta DP$ can not precisely measure the violation of demographic parity, because it inherently has the following drawbacks: \textit{i)} zero-value $\Delta DP$ does not guarantee zero violation of demographic parity, \textit{ii)} $\Delta DP$ values can vary with different classification thresholds. To this end, we propose two new fairness metrics, \textsf{A}rea \textsf{B}etween \textsf{P}robability density function \textsf{C}urves (\textsf{ABPC}) and \textsf{A}rea \textsf{B}etween \textsf{C}umulative density function \textsf{C}urves (\textsf{ABCC}), to precisely measure the violation of demographic parity in distribution level. The new fairness metrics directly measure the difference between the distributions of the prediction probability for different demographic groups. Thus our proposed new metrics enjoy: \textit{i)} zero-value \textsf{ABCC}/\textsf{ABPC} guarantees zero violation of demographic parity; \textit{ii)} \textsf{ABCC}/\textsf{ABPC} guarantees demographic parity while the classification threshold adjusted. We further re-evaluate the existing fair models with our proposed fairness metrics and observe different fairness behaviors of those models under the new metrics. The code is anonymously available at \url{https://anonymous.4open.science/r/fairness_metric-36EC}.

URL: https://openreview.net/forum?id=LjDFIWWVVa

---

Title: Noise-robust Graph Learning by Estimating and Leveraging Pairwise Interactions

Abstract: Teaching Graph Neural Networks (GNNs) to accurately classify nodes under severely noisy labels is an important problem in real-world graph learning applications, but is currently underexplored. Although pairwise training methods have demonstrated promise in supervised metric learning and unsupervised contrastive learning, they remain less studied on noisy graphs, where the structural pairwise interactions (PI) between nodes are abundant and thus might benefit label noise learning rather than the pointwise methods. This paper bridges the gap by proposing a pairwise framework for noisy node classification on graphs, which relies on the PI as a primary learning proxy in addition to the pointwise learning from the noisy node class labels. Our proposed framework PI-GNN contributes two novel components: (1) a confidence-aware PI estimation model that adaptively estimates the PI labels, which are defined as whether the two nodes share the same node labels, and (2) a decoupled training approach that leverages the estimated PI labels to regularize a node classification model for robust node classification. Extensive experiments on different datasets and GNN architectures demonstrate the effectiveness of PI-GNN, yielding a promising improvement over the state-of-the-art methods.

URL: https://openreview.net/forum?id=r7imkFEAQb

---

Title: Fairness via In-Processing in the Over-parameterized Regime: A Cautionary Tale

Abstract: The success of deep learning is driven by the counter-intuitive ability of over-parameterized deep neural networks (DNNs) to generalize, even when they have sufficiently many parameters to perfectly fit the training data. In practice, test error often continues to decrease with increasing over-parameterization, a phenomenon referred to as double descent. This allows deep learning engineers to instantiate large models without having to worry about over-fitting. Despite its benefits, however, prior work has shown that over-parameterization can exacerbate bias against minority subgroups. Several fairness-constrained DNN training methods have been proposed to address this concern. Here, we critically examine MinDiff, a fairness-constrained training procedure implemented within TensorFlow's Responsible AI Toolkit, that aims to achieve Equality of Opportunity. We show that although MinDiff improves fairness for under-parameterized models, it is likely to be ineffective in the over-parameterized regime. This is because an overfit model with zero training loss is trivially group-wise fair on training data, creating an “illusion of fairness,” thus turning off the MinDiff optimization (this will apply to any disparity-based measures which care about errors or accuracy. It won't apply to demographic parity). We find that within specified fairness constraints, under-parameterized MinDiff models can even have lower error compared to their over-parameterized counterparts (despite baseline over-parameterized models having lower error compared to their under-parameterized counterparts). We further show that MinDiff optimization is very sensitive to choice of batch size in the under-parameterized regime. Thus, fair model training using MinDiff requires time-consuming hyper-parameter searches. Finally, we suggest using previously proposed regularization techniques, viz. L2, early stopping and flooding in conjunction with MinDiff to train fair over-parameterized models. In our results, over-parameterized models trained using MinDiff+regularization with standard batch sizes are fairer than their under-parameterized counterparts, suggesting that at the very least, regularizers should be integrated into fair deep learning flows.

URL: https://openreview.net/forum?id=f4VyYhkRvi

---

Title: Limitations of and Alternatives to Benchmarking in Reinforcement Learning Research

Abstract: Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we question the value of performance evaluation as a primary experimentation tool and argue for using a qualitatively different experimentation paradigm that can provide more insight from less computation. Furthermore, we strongly recommend that the community switch to the new experimentation paradigm and encourage reviewers to adopt stricter standards for experiments.

URL: https://openreview.net/forum?id=HP1V7QRXgv

---

Title: On Averaging ROC Curves

Abstract: Receiver operating characteristic (ROC) curves are a popular method of summarising the performance of classifiers. The ROC curve describes the separability of the distributions of predictions from a two-class classifier. There are a variety of situations in which an analyst seeks to aggregate multiple ROC curves into a single representative example. A number of methods of doing so are available; however, there is a degree of subtlety that is often overlooked when selecting the appropriate one. An important component of this relates to the interpretation of the decision process for which the classifier will be used. This paper summarises a number of methods of aggregation and carefully delineates the interpretations of each in order to inform their correct usage. A toy example is provided that highlights how an injudicious choice of aggregation method can lead to erroneous conclusions.

URL: https://openreview.net/forum?id=FByH3qL87G

---

Title: Lifelong Reinforcement Learning with Modulating Masks

Abstract: Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification,
lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.

URL: https://openreview.net/forum?id=V7tahqGrOq

---

Title: Multi-point Dimensionality Reduction to Improve Projection Layout Reliability

Abstract: In ordinary Dimensionality Reduction (DR), each data instance in a high dimensional space (original space), is mapped to one point in a low dimensional space (visual space). This builds a layout of projected points that attempts to preserve as much as possible some property of data such as distances, neighbourhood relationships, and/or topology structures, but with the ultimate goal of approximating semantic properties of data. The approximation of semantic properties, is achieved by preserving geometric properties or topology structures in visual space. In this paper, the first general algorithm of Multi-point Dimensionality Reduction is introduced on where each data instance can be mapped to possibly more than one point in visual space with the aim of improving reliability, usability and interpretability of dimensionality reduction. Furthermore, by allowing the points in visual space to be split into two layers while maintaining the possibility of having more than one projection per data instance, the benefit of separating more reliable points from less reliable points is discussed. The proposed algorithm in this paper, named Layered Vertex Splitting Data Embedding (LVSDE), is built upon and extends a combination of ordinary DR and graph drawing techniques. Based on the experiments of this paper, the particular proposed algorithm (LVSDE) practically outperforms popular ordinary DR methods visually in terms of semantics, group separation, subgroup detection or combinational group detection.

URL: https://openreview.net/forum?id=mSDlVLfXN6

---

Title: An Analysis of Abstracted Model-Based Reinforcement Learning

Abstract: Many methods for Model-based Reinforcement learning (MBRL) in Markov decision processes (MDPs) provide guarantees for both the accuracy of the model they can deliver and the learning efficiency. At the same time, state abstraction techniques allow for a reduction
of the size of an MDP while maintaining a bounded loss with respect to the original problem. Therefore, it may come as a surprise that no such guarantees are available when combining both techniques, i.e., where MBRL merely observes abstract states. Our theoretical analysis
shows that abstraction can introduce a dependence between samples collected online (e.g., in the real world). That means that, without taking this dependence into account, results for MBRL do not directly extend to this setting. Our result shows that we can use concentration inequalities for martingales to overcome this problem. This result makes it possible to extend the guarantees of existing MBRL algorithms to the setting with abstraction. We illustrate this by combining R-MAX, a prototypical MBRL algorithm, with abstraction, thus producing the first performance guarantees for ‘Abstracted RL’: model-based reinforcement learning with an abstract model.

URL: https://openreview.net/forum?id=YQWOzzSMPp

---

Reply all

Reply to author

Forward

0 new messages