# Weekly TMLR digest for Aug 07, 2022

2 views

### TMLR

Aug 6, 2022, 8:00:10 PMAug 6

Accepted papers
===============

Title: Do ReLU Networks Have An Edge When Approximating Compactly-Supported Functions?

Authors: Anastasis Kratsios, Behnoosh Zamanlooy

Abstract: We study the problem of approximating compactly-supported integrable functions while implementing their support set using feedforward neural networks. Our first main result transcribes this structured'' approximation problem into a universality problem. We do this by constructing a refinement of the usual topology on the space $L^1_{\operatorname{loc}}(\mathbb{R}^d,\mathbb{R}^D)$ of locally-integrable functions in which compactly-supported functions can only be approximated in $L^1$-norm by functions with matching discretized support. We establish the universality of ReLU feedforward networks with bilinear pooling layers in this refined topology. Consequentially, we find that ReLU feedforward networks with bilinear pooling can approximate compactly supported functions while implementing their discretized support. We derive a quantitative uniform version of our universal approximation theorem on the dense subclass of compactly-supported Lipschitz functions. This quantitative result expresses the depth, width, and the number of bilinear pooling layers required to construct this ReLU network via the target function's regularity, the metric capacity and diameter of its essential support, and the dimensions of the inputs and output spaces. Conversely, we show that polynomial regressors and analytic feedforward networks are not universal in this space.

---

Title: Recurrent networks, hidden states and beliefs in partially observable environments

Authors: Gaspard Lambrechts, Adrien Bolland, Damien Ernst

Abstract: Reinforcement learning aims to learn optimal policies from interaction with environments whose dynamics are unknown. Many methods rely on the approximation of a value function to derive near-optimal policies. In partially observable environments, these functions depend on the complete sequence of observations and past actions, called the history. In this work, we show empirically that recurrent neural networks trained to approximate such value functions internally filter the posterior probability distribution of the current state given the history, called the belief. More precisely, we show that, as a recurrent neural network learns the Q-function, its hidden states become more and more correlated with the beliefs of state variables that are relevant to optimal control. This correlation is measured through their mutual information. In addition, we show that the expected return of an agent increases with the ability of its recurrent architecture to reach a high mutual information between its hidden states and the beliefs. Finally, we show that the mutual information between the hidden states and the beliefs of variables that are irrelevant for optimal control decreases through the learning process. In summary, this work shows that in its hidden states, a recurrent neural network approximating the Q-function of a partially observable environment reproduces a sufficient statistic from the history that is correlated to the relevant part of the belief for taking optimal actions.

---

Title: Self-supervise, Refine, Repeat: Improving Unsupervised Anomaly Detection

Authors: Jinsung Yoon, Kihyuk Sohn, Chun-Liang Li, Sercan O Arik, Chen-Yu Lee, Tomas Pfister

Abstract: Anomaly detection (AD), separating anomalies from normal data, has many applications across domains, from security to healthcare. While most previous works were shown to be effective for cases with fully or partially labeled data, that setting is in practice less common due to labeling being particularly tedious for this task. In this paper, we focus on fully unsupervised AD, in which the entire training dataset, containing both normal and anomalous samples, is unlabeled. To tackle this problem effectively, we propose to improve the robustness of one-class classification trained on self-supervised representations using a data refinement process. Our proposed data refinement approach is based on an ensemble of one-class classifiers (OCCs), each of which is trained on a disjoint subset of training data. Representations learned by self-supervised learning on the refined data are iteratively updated as the data refinement improves. We demonstrate our method on various unsupervised AD tasks with image and tabular data. With a 10% anomaly ratio on CIFAR-10 image data / 2.5% anomaly ratio on Thyroid tabular data, the proposed method outperforms the state-of-the-art one-class classifier by 6.3 AUC and 12.5 average precision / 22.9 F1-score.

---

Title: Exploring Generative Neural Temporal Point Process

Authors: Haitao Lin, Lirong Wu, Guojiang Zhao, Liu Pai, Stan Z. Li

Abstract: Temporal point process (TPP) is commonly used to model the asynchronous event sequence featuring occurrence timestamps and revealed by probabilistic models conditioned on historical impacts.
While lots of previous works have focused on goodness-of-fit' of TPP models by maximizing the likelihood, their predictive performance is unsatisfactory, which means the timestamps generated by models are far apart from true observations.
Recently, deep generative models such as denoising diffusion and score matching models have achieved great progress in image generating tasks by demonstrating their capability of generating samples of high quality.
However, there are no detailed and unified works exploring and studying the potential of generative models in the context of event prediction of TPP.
In this work, we try to fill the gap by designing a unified generative framework for neural temporal point process (GNTPP) model to explore their feasibility and effectiveness, and further improve models' predictive performance.
Besides, in terms of measuring the historical impacts, we revise the attentive models which summarize influence from historical events with an adaptive reweighting term considering events' type relation and time intervals.
Extensive experiments have been conducted to illustrate the improved predictive capability of GNTPP with a line of generative probabilistic decoders, and performance gain from the revised attention.
To the best of our knowledge, this is the first work that adapts generative models in a complete unified framework and studies their effectiveness in the context of TPP.

---

Title: A Comprehensive Study of Real-Time Object Detection Networks Across Multiple Domains: A Survey

Authors: Elahe Arani, Shruthi Gowda, Ratnajit Mukherjee, Omar Magdy, Senthilkumar Sockalingam Kathiresan, Bahram Zonooz

Abstract: Deep neural network based object detectors are continuously evolving and are used in a multitude of applications, each having its own set of requirements. While safety-critical applications need high accuracy and reliability, low-latency tasks need resource and energy-efficient networks. Real-time detection networks, which are a necessity in high-impact real-world applications, are continuously proposed but they overemphasize the improvements in accuracy and speed while other capabilities such as versatility, robustness, resource, and energy efficiency are omitted. A reference benchmark for existing networks does not exist nor does a standard evaluation guideline for designing new networks, which results in ambiguous and inconsistent comparisons. We, therefore, conduct a comprehensive study on multiple real-time detection networks (anchor-based, keypoint-based, and transformer-based) on a wide range of datasets and report results on an extensive set of metrics. We also study the impact of variables such as image size, anchor dimensions, confidence thresholds, and architecture layers on the overall performance. We analyze the robustness of detection networks against distribution shift, natural corruptions, and adversarial attacks. Also, we provide the calibration analysis to gauge the reliability of the predictions. Finally, to highlight the real-world impact, we conduct two unique case studies, on autonomous driving and healthcare application. To further gauge the capability of networks in critical real-time applications, we report the performance after deploying the detection networks on edge devices. Our extensive empirical study can act as a guideline for the industrial community to make an informed choice on the existing networks. We also hope to inspire the research community towards a new direction of design and evaluation of networks that focuses on the bigger and holistic overview for a far-reaching impact.

---

Title: Domain-invariant Feature Exploration for Domain Generalization

Authors: Wang Lu, Jindong Wang, Haoliang Li, Yiqiang Chen, Xing Xie

Abstract: Deep learning has achieved great success in the past few years. However, the performance of deep learning is likely to impede in face of non-IID situations. Domain generalization (DG) enables a model to generalize to an unseen test distribution, i.e., to learn domain-invariant representations. In this paper, we argue that domain-invariant features should be originating from both internal and mutual sides. Internal invariance means that the features can be learned with a single domain and the features capture intrinsic semantics of data, i.e., the property within a domain, which is agnostic to other domains. Mutual invariance means that the features can be learned with multiple domains (cross-domain) and the features contain common information, i.e., the transferable features w.r.t. other domains. We then propose DIFEX for Domain-Invariant Feature EXploration. DIFEX employs a knowledge distillation framework to capture the high-level Fourier phase as the internally-invariant features and learn cross-domain correlation alignment as the mutually-invariant features. We further design an exploration loss to increase the feature diversity for better generalization. Extensive experiments on both time-series and visual benchmarks demonstrate that the proposed DIFEX achieves state-of-the-art performance.

---

Title: Stable and Interpretable Unrolled Dictionary Learning

Authors: Bahareh Tolooshams, Demba E. Ba

Abstract: The dictionary learning problem, representing data as a combination of a few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and a rich literature has studied its theoretical convergence. The success of dictionary learning relies on access to a good initial estimate of the dictionary and the ability of the sparse coding step to provide an unbiased estimate of the code. The growing popularity of unrolled sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. We offer the theoretical analysis of these empirical results through PUDLE, a Provable Unrolled Dictionary LEarning method. We provide conditions on the network initialization and data distribution sufficient to recover and preserve the support of the latent code. Additionally, we address two challenges; first, the vanilla unrolled sparse coding computes a biased code estimate, and second, gradients during backpropagated learning can become unstable. We show approaches to reduce the bias of the code estimate in the forward pass, and that of the dictionary estimate in the backward pass. We propose strategies to resolve the learning instability by tuning network parameters and modifying the loss function. Overall, we highlight the impact of loss, unrolling, and backpropagation on convergence. We complement our findings through synthetic and image denoising experiments. Finally, we demonstrate PUDLE's interpretability, a driving factor in designing deep networks based on iterative optimizations, by building a mathematical relation between network weights, its output, and the training set.

---

Title: Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

Authors: David Peer, Bart Keulen, Sebastian Stabinger, Justus Piater, Antonio Rodriguez-sanchez

Abstract: Training deep neural networks is a very demanding task, especially challenging is how to adapt architectures to improve the performance of trained models. We can find that sometimes, shallow networks generalize better than deep networks, and the addition of more layers results in higher training and test errors. The deep residual learning framework addresses this degradation problem by adding skip connections to several neural network layers. It would at first seem counter-intuitive that such skip connections are needed to train deep networks successfully as the expressivity of a network would grow exponentially with depth. In this paper, we first analyze the flow of information through neural networks. We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network. We prove empirically and theoretically that a positive batch-entropy is required for gradient descent-based training approaches to optimize a given loss function successfully. Based on those insights, we introduce batch-entropy regularization to enable gradient descent-based training algorithms to optimize the flow of information through each hidden layer individually. With batch-entropy regularization, gradient descent optimizers can transform untrainable networks into trainable networks. We show empirically that we can therefore train a "vanilla" fully connected network and convolutional neural network---no skip connections, batch normalization, dropout, or any other architectural tweak---with 500 layers by simply adding the batch-entropy regularization term to the loss function. The effect of batch-entropy regularization is not only evaluated on vanilla neural networks, but also on residual networks, autoencoders, and also transformer models over a wide range of computer vision as well as natural language processing tasks.

---

Title: Online Coresets for Parameteric and Non-Parametric Bregman Clustering

Authors: Supratim Shit, Anirban Dasgupta, Rachit Chhaya, Jayesh Choudhari

Abstract: We present algorithms that create coresets in an online setting for clustering problems based on a wide subset of Bregman divergences. Notably, our coresets have a small additive error, similar in magnitude to the gap between expected and empirical loss (Bachem et. al. 2017), and take update time $O(d)$ for every incoming point where $d$ is the dimension of the point. Our first algorithm gives online coresets of size $\tilde{O}(\mbox{poly}(k,d,\epsilon,\mu))$ for $k$-clusterings according to any $\mu$-similar Bregman divergence. We further extend this algorithm to show the existence of non-parametric coresets, where the coreset size is independent of $k$, the number of clusters, for the same subclass of Bregman divergences. Our non-parametric coresets also function as coresets for non-parametric versions of the Bregman clustering like DP-Means. While these coresets provide additive error guarantees, they are significantly smaller for high dimensional data than the (relative-error) coresets obtained in (Bachem et. al 2015) for DP-Means--- for the input of size $n$ our coresets grow as $O(\log n)$ while being independent of $d$ as opposed to $O(d^d)$ for points in $\~R^d$ (Bachem et. al 2015). We also present experiments to compare the performance of our algorithms with other sampling techniques.

---

Title: Max-Affine Spline Insights Into Deep Network Pruning

Authors: Haoran You, Randall Balestriero, Zhihan Lu, Yutong Kou, Huihong Shi, Shunyao Zhang, Shang Wu, Yingyan Lin, Richard Baraniuk

Abstract: State-of-the-art (SOTA) approaches to deep network (DN) training overparametrize the model and then prune a posteriori to obtain a winning ticket'' subnetwork that can achieve high accuracy. Using a recently developed spline interpretation of DNs, we obtain novel insights into how DN pruning affects its mapping. In particular, under the realm of spline operators, we are able to pinpoint the impact of pruning onto the DN's underlying input space partition and per-region affine mappings, opening new avenues in understanding why and when are pruned DNs able to maintain high performance. We also discover that a DN's spline mapping exhibits an early-bird (EB) phenomenon whereby the spline's partition converges at early training stages, bridging the recently developed DN spline theory and lottery ticket hypothesis of DNs. We finally leverage this new insight to develop a principled and efficient pruning strategy whose goal is to prune isolated groups of nodes that have a redundant contribution in the forming of the spline partition.
Extensive experiments on four networks and three datasets validate that our new spline-based DN pruning approach reduces training FLOPs by up to 3.5x while achieving similar or even better accuracy than current state-of-the-art methods. Code is available at https://github.com/RICE-EIC/Spline-EB.

---

Title: Did I do that? Blame as a means to identify controlled effects in reinforcement learning

Authors: Oriol Corcoll, Youssef Sherif Mansour Mohamed, Raul Vicente

Abstract: Affordance learning is a crucial ability of intelligent agents. This ability relies on understanding the different ways the environment can be controlled. Approaches encouraging RL agents to model controllable aspects of their environment have repeatedly achieved state-of-the-art results. Despite their success, these approaches have only been studied using generic tasks as a proxy but have not yet been evaluated in isolation. In this work, we study the problem of identifying controlled effects from a causal perspective. Humans compare counterfactual outcomes to assign a degree of blame to their actions. Following this idea, we propose Controlled Effect Network (CEN), a self-supervised method based on the causal concept of blame. CEN is evaluated in a wide range of environments against two state-of-the-art models, showing that it precisely identifies controlled effects.

---

New submissions
===============

Title: Medium Samples Are All You Need: Sample Easiness-Aware Learning for Neural Networks

Abstract: Deep learning models benefit from training on a vast collection of samples. However, training on more samples is not necessarily equivalent to higher performance in terms of accuracy and speed. Learning with organized training materials, by emphasizing on easy or hard examples, is shown to achieve better performance in certain scenarios. While there is no standard strategy for easiness evaluation, a loss metric is frequently used as an indicator. In this work, we propose a unified sample easiness estimator to quantify the level of easiness of a model on a sample. We further propose a novel loss function, named Sample Easiness-based Loss (SEL), which regularizes class probabilities to be better used in sample easiness estimates. SEL can be easily applied to any neural network architecture without any modification. We then provide a novel neural network training strategy, sample easiness-based training (SET), to offer a choice of training with designated sample easiness, e.g., medium easiness, to reduce the training time significantly. Results show that our SET utilizes only 0.06%--11.18% of training samples while achieving similar or higher test accuracies. In addition, we demonstrate that our sample easiness framework is helpful in mislabeled data identification task.

---

Title: Transferable and Adaptable Driving Behavior Prediction

Abstract: While autonomous vehicles still struggle to solve challenging situations during on-road driving, humans have long mastered the essence of driving with efficient, transferable, and adaptable driving capability. The obvious gap between humans and autonomous vehicles keeps us wondering about the essence of how human learns to drive. Inspired by humans' cognition model and semantic understanding during driving in a hierarchical learning procedure, we propose HATN, a hierarchical framework to generate high-quality, transferable, and adaptable predictions for driving behaviors in multi-agent dense-traffic environments. Our hierarchical method consists of a high-level intention identification policy and a low-level trajectory generation policy. We introduce a novel semantic definition for the two policies and generic state representation for each policy, so that the hierarchical framework is transferable across different driving scenarios. Besides, our model is able to capture variations of driving behaviors among individuals and scenarios by an online adaptation module. We demonstrate our algorithms in the task of trajectory prediction for real traffic data at intersections and roundabouts from the INTERACTION dataset. Through extensive numerical studies, it is evident that our method significantly outperformed other methods in terms of prediction accuracy, transferability, and adaptability. Pushing the performance by a considerable margin, we also provide a cognitive view of understanding the driving behavior behind such improvement. We highlight that in the future, more research attention and effort are deserved for the transferability and adaptability of autonomous driving planning and prediction algorithms. It is not only due to the promising performance elevation, but more fundamentally, they are crucial for the scalable and general deployment of autonomous vehicles.

---

Title: Neural Fixed-Point Acceleration for Second-order Cone Optimization Problems

Abstract:
Continuous fixed-point problems are a computational primitive in numerical computing, optimization, machine learning, and the natural and social sciences, and have recently been incorporated into deep learning models as optimization layers. Acceleration of fixed-point computations has traditionally been explored in optimization research without the use of learning. In this work, we introduce neural fixed-point acceleration, a framework to automatically learn to accelerate fixed-point problems that are drawn from a distribution; a key question motivating our work is to better understand the characteristics that make neural acceleration more beneficial for some problems than others. We apply the framework to solve second-order cone programs with the Splitting Conic Solver (SCS), and evaluate on distributions of Lasso problems and Kalman filtering problems. Our main results show that we are able to get a 10× performance improvement in accuracy on the Kalman filtering distribution, while those on Lasso are much more modest. We then isolate a few factors that make neural acceleration much more useful on the Kalman filtering distribution than on the Lasso distribution; we apply a number of problem and distribution modifications on a scaled-down version of the Lasso problem, adding in properties that make it structurally closer to Kalman filtering, and show when the problem benefits from neural acceleration.

---

Title: Systematically and efficiently improving existing $k$-means initialization algorithms by pairwise-nearest-neighbor smoothing

Abstract: We present a meta-method for initializing (seeding) the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting
a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data $N$ and the number of clusters $k$, PNN-smoothing is also almost linear with an appropriate choice of $J$, and quite competitive in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. Our implementation is publicly available at [ANONYMIZED - ATTACHED AS AUXILIARY MATERIAL FOR THE REVIEW].

---

Title: On Noise Abduction for Answering Counterfactual Queries: A Practical Outlook

Abstract: A crucial step in counterfactual inference is abduction - inference of the exogenous noise variables. Deep Learning approaches model an exogenous noise variable as a latent variable. Our ability to infer a latent variable comes at a computational cost as well as a statistical cost. In this paper, we show that it may not be necessary to abduct all the noise variables in a structural causal model (SCM) to answer a counterfactual query. In a fully specified causal model with no unobserved confounding, we also identify exogenous noises that must be abducted for a counterfactual query. We introduce a graphical condition for noise identification from an action consisting of an arbitrary combination of hard and soft interventions. We report experimental results on both synthetic and real-world German Credit Dataset showcasing the promise and usefulness of the proposed exogenous noise identification.

---

Title: Multi-Source Causal Inference Using Control Variates under Outcome Selection Bias

Abstract: While many areas of machine learning have benefited from the increasing availability of large and varied datasets, the benefit to causal inference has been limited given the strong assumptions needed to ensure the identifiability of causal effects -- which are often not satisfied in real-world datasets. For example, many large observational datasets (e.g., case-control studies in epidemiology, click-through data in recommender systems) suffer from selection bias on the outcome, which makes the average treatment effect (ATE) non-identifiable. We propose an algorithm to estimate causal effects from multiple data sources, where the ATE may be identifiable only in some datasets but not others. The idea is to construct control variates across the datasets in which the ATE may not be identifiable, which provably reduces the variance of the ATE estimate. We focus on a setting where the observational datasets suffer from outcome selection bias, assuming access to an auxiliary small dataset from which we can obtain a consistent estimate of the ATE. We propose a construction of control variate by taking the difference of the conditional odds ratio estimates from the two datasets. Across simulations and two case studies with real data, we show that the control variate-based ATE estimator has consistently and significantly reduced variance against different baselines.

---

Title: An Efficient One-Class SVM for Novelty Detection in IoT

Abstract: One-Class Support Vector Machines (OCSVM) are a state-of-the-art approach for novelty detection, due to their flexibility in fitting complex nonlinear boundaries between {normal} and {novel} data. Novelty detection is important in the Internet of Things (`IoT'') due to the threats these devices can present, and OCSVM often performs well in these environments due to the variety of devices, traffic patterns, and anomalies that IoT devices present. Unfortunately, conventional OCSVMs can introduce prohibitive memory and computational overhead at detection time. This work designs, implements and evaluates an efficient OCSVM for such practical settings. We extend Nystr\"om and (Gaussian) Sketching approaches to OCSVM, combining these methods with clustering and Gaussian mixture models to achieve 15-30x speedup in prediction time and 30-40x reduction in memory requirements, without sacrificing detection accuracy. Here, the very nature of IoT devices is crucial: they tend to admit few modes of \emph{normal} operation, allowing for efficient pattern compression.

---

Title: Measuring and mitigating interference in reinforcement learning

Abstract: Catastrophic interference is common in many network-based learning systems, and many proposals exist for mitigating it. Before overcoming interference we must understand it better. In this work, we provide a definition and novel measure of interference for value-based reinforcement learning methods such as Fitted Q-Iteration and DQN. We systematically evaluate our measure of interference, showing that it correlates with instability in control performance, across a variety of network architectures. Our new interference measure allows us to ask novel scientific questions about commonly used deep learning architectures and study learning algorithms which mitigate interference. Lastly, we outline a class of algorithms which we call online-aware that are designed to mitigate interference, and show they do reduce interference according to our measure and that they improve stability and performance in several classic control environments.

---

Title: Unsupervised Learning of Neurosymbolic Encoders

Abstract: We present a framework for the unsupervised learning of neurosymbolic encoders, which are encoders obtained by composing neural networks with symbolic programs from a domain-specific language. Our framework naturally incorporates symbolic expert knowledge into the learning process, which leads to more interpretable and factorized latent representations compared to fully neural encoders. We integrate modern program synthesis techniques with the variational autoencoding (VAE) framework, in order to learn a neurosymbolic encoder in conjunction with a standard decoder. The programmatic descriptions from our encoders can benefit many analysis workflows, such as in behavior modeling where interpreting agent actions and movements is important. We evaluate our method on learning latent representations for real-world trajectory data from animal biology and sports analytics. We show that our approach offers significantly better separation than standard VAEs and leads to practical gains on downstream analysis tasks, such as for behavior classification.

---

Title: Policy Optimization with Distributional Constraints: An Optimal Transport View

Abstract: We consider constrained policy optimization in Reinforcement Learning, where the constraints are in form of marginals on state visitations and global action executions. Given these distributions, we formulate policy optimization as unbalanced optimal transport over the space of occupancy measures. We propose a general purpose RL objective based on Bregman divergence and optimize it using Dykstra’s algorithm. The approach admits an actor-critic algorithm for when the state or action space is large, and only samples from the marginals are available. We discuss applications of our approach and provide demonstrations to show the effectiveness of our algorithm.

---

Title: Towards Bridging the gap between Empirical and Certified Robustness against Adversarial Examples

Abstract: The current state-of-the-art defense methods against adversarial examples typically focus on improving either empirical or certified robustness. Among them, adversarially trained (AT) models produce empirical state-of-the-art defense against adversarial examples without providing any robustness guarantees for large classifiers or higher-dimensional inputs. In contrast, existing randomized smoothing based models achieve state-of-the-art certified robustness while significantly degrading the empirical robustness against adversarial examples.
In this paper, we propose a novel method, called \emph{Certification through Adaptation}, that transforms an AT model into a randomized smoothing classifier during inference to provide certified robustness for $\ell_2$ norm without affecting their empirical robustness against adversarial attacks. We also propose \emph{Auto-Noise} technique that efficiently approximates the appropriate noise levels to flexibly certify the test examples using randomized smoothing technique. Our proposed \emph{Certification through Adaptation} with \emph{Auto-Noise} technique achieves an \textit{average certified radius (ACR) scores} up to $1.102$ and $1.148$ respectively for CIFAR-10 and ImageNet datasets using AT models without affecting their empirical robustness or benign accuracy. Therefore, our paper is a step towards bridging the gap between the empirical and certified robustness against adversarial examples by achieving both using the same classifier.

---

Title: Variational Auto-encoders for Causal Inference

Abstract: Estimating causal effects from observational data (at either an individual- or a population- level) is critical for making many types of decisions. One approach to address this task is to learn decomposed representations of the underlying factors of data; this becomes significantly more challenging when there are confounding factors (which influence both the cause and the effect). In this paper, we take a generative approach that builds on the recent advances in Variational Auto-Encoders to simultaneously learn those underlying factors as well as the causal effects. We propose a progressive sequence of models, where each improves over the previous one, culminating in the Hybrid model. Our empirical results demonstrate that the performance of the proposed hybrid models are superior to both state-of-the-art discriminative as well as other generative approaches in the literature.

---

Title: Approximate Policy Iteration with Bisimulation Metrics

Abstract: Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation (VFA). In this work we first prove that bisimulation and $\pi$-bisimulation metrics can be defined via a more general class of Sinkhorn distances, which unifies various state similarity metrics used in recent work. Then we describe an approximate policy iteration (API) procedure that uses a bisimulation-based discretization of the state space for VFA and prove asymptotic performance bounds. Next, we bound the difference between $\pi$-bisimulation metrics in terms of the change in the policies themselves. Based on these results, we design an API($\alpha$) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. We discuss how such API procedures map onto practical actor-critic methods that use bisimulation metrics for state representation learning. Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs.

---

Title: Revisiting the Noise Model of Stochastic Gradient Descent

Abstract: The stochastic gradient noise (SGN) is known as a significant factor in the success of stochastic gradient descent (SGD).
Following the central limit theorem, SGN was initially modeled as Gaussian, and lately, it has been suggested that stochastic gradient noise is better characterized using $S\alpha S$ Lévy distribution.
This claim was allegedly refuted and rebounded to the previously suggested Gaussian noise model.
This paper presents solid and detailed empirical evidence that SGN is heavy-tailed and better depicted by the $S\alpha S$ distribution. Furthermore, we argue that different parameters in a deep neural network (DNN) hold distinct SGN characteristics throughout training. To more accurately approximate the dynamics of SGD near a local minimum, we construct a novel framework in $\mathbb{R}^N$, based on Lévy-driven stochastic differential equation (SDE), where one-dimensional Lévy processes model each parameter in the DNN. Next, we study the effect of learning rate decay (LRdecay) on the training process. We demonstrate theoretically and empirically that its main optimization advantage stems from the reduction of the SGN. Based on our analysis, we examine the mean escape time, trapping probability, and more properties of DNNs near local minima. Finally, we prove that the training process is more likely to exit from the basin in the direction of parameters with heavier tail SGN. We will share our code for reproducibility.

---

Title: Ensemble Policy Optimization with Diversity Regularization

Abstract: In machine learning tasks, ensemble methods have been widely adopted to boost the performance by aggregating multiple learning models. However, ensemble methods are much less explored in the task of reinforcement learning, where most of previous works only combine multiple value estimators or dynamics models and use a mixed policy to explore the environment. In this work, we propose a simple yet effective ensemble policy optimization method to improve the joint performance of the policy ensemble. This method utilizes a policy ensemble where heterogeneous policies explore the environment collectively and their diversity is maintained by the proposed diversity regularization mechanism. We evaluate the proposed method on continuous control tasks and find that by aggregating the learned policies into an ensemble policy in test time, the performance is greatly improved. \revision{DEPO has performance improvement and faster convergence over the base on-policy single-agent method it built upon.} Code will be made publicly available.

---

Title: Local Kernel Ridge Regression for Scalable, Interpolating, Continuous Regression

Abstract: We study a localized version of kernel ridge regression that can continuously, smoothly interpolate the underlying function values which are highly non-linear with observed data points. This new method can deal with the data of which (a) local density is highly uneven and (b) the function values change dramatically in certain small but unknown regions. By introducing a new rank-based interpolation scheme, the interpolated values provided by our local method continuously vary with query points. Our method is scalable by avoiding the full matrix inverse, compared with traditional kernel ridge regression.

---