Weekly TMLR digest for Oct 30, 2022

1 view
Skip to first unread message

TMLR

unread,
Oct 29, 2022, 8:00:10 PM10/29/22
to tmlr-annou...@googlegroups.com

Accepted papers
===============


Title: Benchmarking Progress to Infant-Level Physical Reasoning in AI

Authors: Luca Weihs, Amanda Yuile, Renée Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, Aniruddha Kembhavi

Abstract: To what extent do modern AI systems comprehend the physical world? We introduce the open-access Infant-Level Physical Reasoning Benchmark (InfLevel) to gain insight into this question. We evaluate ten neural-network architectures developed for video understanding on tasks designed to test these models' ability to reason about three essential physical principles which researchers have shown to guide human infants' physical understanding. We explore the sensitivity of each AI system to the continuity of objects as they travel through space and time, to the solidity of objects, and to gravity. We find strikingly consistent results across 60 experiments with multiple systems, training regimes, and evaluation metrics: current popular visual-understanding systems are at or near chance on all three principles of physical reasoning. We close by suggesting some potential ways forward.

URL: https://openreview.net/forum?id=9NjqD9i48M

---

Title: Explicit Group Sparse Projection with Applications to Deep Learning and NMF

Authors: Riyasat Ohib, Nicolas Gillis, Niccolo Dalmasso, Sameena Shah, Vamsi K. Potluru, Sergey Plis

Abstract: We design a new sparse projection method for a set of vectors that guarantees a desired average sparsity level measured leveraging the popular Hoyer measure (an affine function of the ratio of the $\ell_1$ and $\ell_2$ norms).
Existing approaches either project each vector individually or require the use of a regularization parameter which implicitly maps to the average $\ell_0$-measure of sparsity. Instead, in our approach we set the sparsity level for the whole set explicitly and simultaneously project a group of vectors with the sparsity level of each vector tuned automatically.
We show that the computational complexity of our projection operator is linear in the size of the problem.
Additionally, we propose a generalization of this projection by replacing the $\ell_1$ norm by its weighted version.
We showcase the efficacy of our approach in both supervised and unsupervised learning tasks on image datasets including CIFAR10 and ImageNet. In deep neural network pruning, the sparse models produced by our method on ResNet50 have significantly higher accuracies at corresponding sparsity values compared to existing competitors. In nonnegative matrix factorization, our approach yields competitive reconstruction errors against state-of-the-art algorithms.

URL: https://openreview.net/forum?id=jIrOeWjdpc

---

Title: Linear algebra with transformers

Authors: Francois Charton

Abstract: Transformers can learn to perform numerical computations from examples only. I study nine problems of linear algebra, from basic matrix operations to eigenvalue decomposition and inversion, and introduce and discuss four encoding schemes to represent real numbers.
On all problems, transformers trained on sets of random matrices achieve high accuracies (over 90\%). The models are robust to noise, and can generalize out of their training distribution. In particular, models trained to predict Laplace-distributed eigenvalues generalize to different classes of matrices: Wigner matrices or matrices with positive eigenvalues. The reverse is not true.


URL: https://openreview.net/forum?id=Hp4g7FAXXG

---

Title: INR-V: A Continuous Representation Space for Video-based Generative Tasks

Authors: Bipasha Sen, Aditya Agarwal, Vinay P Namboodiri, C.V. Jawahar

Abstract: Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showing the potential of the proposed representation space.

URL: https://openreview.net/forum?id=aIoEkwc2oB

---

Title: A Simple Convergence Proof of Adam and Adagrad

Authors: Alexandre Défossez, Leon Bottou, Francis Bach, Nicolas Usunier

Abstract: We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer, the dimension $d$, and the total number of iterations $N$.
This bound can be made arbitrarily small, and with the right hyper-parameters, Adam can be shown to converge with the same rate of convergence $O(d\ln(N)/\sqrt{N})$. When used with the default parameters, Adam doesn't converge, however, and just like constant step-size SGD, it moves away from the initialization point faster than
Adagrad, which might explain its practical success.
Finally, we obtain the tightest dependency on the heavy ball momentum decay rate $\beta_1$ among all previous convergence bounds for non-convex Adam and Adagrad,
improving from $O((1-\beta_1)^{-3})$ to $O((1-\beta_1)^{-1})$.

URL: https://openreview.net/forum?id=ZPQhzTSWA7

---

Title: On the Paradox of Certified Training

Authors: Nikola Jovanović, Mislav Balunovic, Maximilian Baader, Martin Vechev

Abstract: Certified defenses based on convex relaxations are an established technique for training provably robust models. The key component is the choice of relaxation, varying from simple intervals to tight polyhedra. Counterintuitively, loose interval-based training often leads to higher certified robustness than what can be achieved with tighter relaxations, which is a well-known but poorly understood paradox. While recent works introduced various improvements aiming to circumvent this issue in practice, the fundamental problem of training models with high certified robustness remains unsolved. In this work, we investigate the underlying reasons behind the paradox and identify two key properties of relaxations, beyond tightness, that impact certified training dynamics: continuity and sensitivity. Our extensive experimental evaluation with a number of popular convex relaxations provides strong evidence that these factors can explain the drop in certified robustness observed for tighter relaxations. We also systematically explore modifications of existing relaxations and discover that improving unfavorable properties is challenging, as such attempts often harm other properties, revealing a complex tradeoff. Our findings represent an important first step towards understanding the intricate optimization challenges involved in certified training.

URL: https://openreview.net/forum?id=atJHLVyBi8

---

Title: Time Series Alignment with Global Invariances

Authors: Titouan Vayer, Romain Tavenard, Laetitia Chapel, Rémi Flamary, Nicolas Courty, Yann Soullard

Abstract: Multivariate time series are ubiquitous objects in signal processing. Measuring a distance or similarity between two such objects is of prime interest in a variety of applications, including machine learning, but can be very difficult as soon as the temporal dynamics and the representation of the time series, i.e. the nature of the observed quantities, differ from one another. In this work, we propose a novel distance accounting both feature space and temporal variabilities by learning a latent global transformation of the feature space together with a temporal alignment, cast as a joint optimization problem. The versatility of our framework allows for several variants depending on the invariance class at stake. Among other contributions, we define a differentiable loss for time series and present two algorithms for the computation of time series barycenters under this new geometry. We illustrate the interest of our approach on both simulated and real world data and show the robustness of our approach compared to state-of-the-art methods.


URL: https://openreview.net/forum?id=JXCH5N4Ujy

---

Title: Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning

Authors: Baijiong Lin, Feiyang YE, Yu Zhang, Ivor Tsang

Abstract: Multi-Task Learning (MTL) has achieved success in various fields. However, training with equal weights for all tasks may cause unsatisfactory performance for part of tasks. To address this problem, there are many works to carefully design dynamical loss/gradient weighting strategies but the basic random experiments are ignored to examine their effectiveness. In this paper, we propose the Random Weighting (RW) methods, including Random Loss Weighting (RLW) and Random Gradient Weighting (RGW), where an MTL model is trained with random loss/gradient weights sampled from a distribution. To show the effectiveness and necessity of RW methods, theoretically, we analyze the convergence of RW and reveal that RW has a higher probability to escape local minima, resulting in better generalization ability. Empirically, we extensively evaluate the proposed RW methods to compare with twelve state-of-the-art methods on five image datasets and two multilingual problems from the XTREME benchmark to show that RW methods can achieve comparable performance with state-of-the-art baselines. Therefore, we think the RW methods are important baselines for MTL and should attract more attention.

URL: https://openreview.net/forum?id=jjtFD8A1Wx

---

Title: Direct Molecular Conformation Generation

Authors: Jinhua Zhu, Yingce Xia, Chang Liu, Lijun Wu, Shufang Xie, Yusong Wang, Tong Wang, Tao Qin, Wengang Zhou, Houqiang Li, Haiguang Liu, Tie-Yan Liu

Abstract: Molecular conformation generation aims to generate three-dimensional coordinates of all the atoms in a molecule and is an important task in bioinformatics and pharmacology. Previous methods usually first predict the interatomic distances, the gradients of interatomic distances or the local structures (e.g., torsion angles) of a molecule, and then reconstruct its 3D conformation. How to directly generate the conformation without the above intermediate values is not fully explored. In this work, we propose a method that directly predicts the coordinates of atoms: (1) the loss function is invariant to roto-translation of coordinates and permutation of symmetric atoms; (2) the newly proposed model adaptively aggregates the bond and atom information and iteratively refines the coordinates of the generated conformation. Our method achieves the best results on GEOM-QM9 and GEOM-Drugs datasets. Further analysis shows that our generated conformations have closer properties (e.g., HOMO-LUMO gap) with the groundtruth conformations. In addition, our method improves molecular docking by providing better initial conformations. All the results demonstrate the effectiveness of our method and the great potential of the direct approach. The code is released at \url{https://github.com/DirectMolecularConfGen/DMCG}.

URL: https://openreview.net/forum?id=lCPOHiztuw

---

Title: Symbolic Regression is NP-hard

Authors: Marco Virgolin, Solon P Pissis

Abstract: Symbolic regression (SR) is the task of learning a model of data in the form of a mathematical expression.
By their nature, SR models have the potential to be accurate and human-interpretable at the same time.
Unfortunately, finding such models, i.e., performing SR, appears to be a computationally intensive task.
Historically, SR has been tackled with heuristics such as greedy or genetic algorithms and, while some works have hinted at the possible hardness of SR, no proof has yet been given that SR is, in fact, NP-hard.
This begs the question: Is there an exact polynomial-time algorithm to compute SR models?
We provide evidence suggesting that the answer is probably negative by showing that SR is NP-hard.

URL: https://openreview.net/forum?id=LTiaPxqe2e

---

Title: Differentially Private Stochastic Expectation Propagation

Authors: Margarita Vinaroz, Mijung Park

Abstract: We are interested in privatizing an approximate posterior inference algorithm, called Expectation
Propagation (EP). EP approximates the posterior distribution by iteratively refining
approximations to the local likelihood terms. By doing so, EP typically provides better posterior
uncertainties than variational inference (VI) which globally approximates the likelihood
term. However, EP needs a large memory to maintain all local approximations associated
with each datapoint in the training data. To overcome this challenge, stochastic expectation
propagation (SEP) considers a single unique local factor that captures the average effect of
each likelihood term to the posterior and refines it in a way analogous to EP. In terms of
privatization, SEP is more tractable than EP. It is because at each factor’s refining step we
fix the remaining factors, where these factors are independent of other datapoints, which is
different from EP. This independence makes the sensitivity analysis straightforward. We
provide a theoretical analysis of the privacy-accuracy trade-off in the posterior distributions
under our method, which we call differentially private stochastic expectation propagation
(DP-SEP). Furthermore, we test the DP-SEP algorithm on both synthetic and real-world
datasets and evaluate the quality of posterior estimates at different levels of guaranteed
privacy.

URL: https://openreview.net/forum?id=e5ILb2Nqst

---

Title: Practicality of generalization guarantees for unsupervised domain adaptation with neural networks

Authors: Adam Breitholtz, Fredrik Daniel Johansson

Abstract: Understanding generalization is crucial to confidently engineer and deploy machine learning models, especially when deployment implies a shift in the data domain.
For such domain adaptation problems, we seek generalization bounds which are tractably computable and tight. If these desiderata can be reached, the bounds can serve as guarantees for adequate performance in deployment.
However, in applications where deep neural networks are the models of choice, deriving results which fulfill these remains an unresolved challenge; most existing bounds are either vacuous or has non-estimable terms, even in favorable conditions.
In this work, we evaluate existing bounds from the literature with potential to satisfy our desiderata on domain adaptation image classification tasks, where deep neural networks are preferred. We find that all bounds are vacuous and that sample generalization terms account for much of the observed looseness, especially when these terms interact with measures of domain shift. To overcome this and arrive at the tightest possible results, we combine each bound with recent data-dependent PAC-Bayes analysis, greatly improving the guarantees. We find that, when domain overlap can be assumed, a simple importance weighting extension of previous work provides the tightest estimable bound. Finally, we study which terms dominate the bounds and identify possible directions for further improvement.

URL: https://openreview.net/forum?id=vUuHPRrWs2

---

Title: On Noise Abduction for Answering Counterfactual Queries: A Practical Outlook

Authors: Saptarshi Saha, Utpal Garain

Abstract: A crucial step in counterfactual inference is abduction - inference of the exogenous noise variables. Deep Learning approaches model an exogenous noise variable as a latent variable. Our ability to infer a latent variable comes at a computational cost as well as a statistical cost. In this paper, we show that it may not be necessary to abduct all the noise variables in a structural causal model (SCM) to answer a counterfactual query. In a fully specified causal model with no unobserved confounding, we also identify exogenous noises that must be abducted for a counterfactual query. We introduce a graphical condition for noise identification from an action consisting of an arbitrary combination of hard and soft interventions. We report experimental results on both synthetic and real-world German Credit Dataset showcasing the promise and usefulness of the proposed exogenous noise identification.

URL: https://openreview.net/forum?id=4FU8Jz1Oyj

---

Title: Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed

Authors: Mélanie Bernhardt, Fabio De Sousa Ribeiro, Ben Glocker

Abstract: Failure detection in automated image classification is a critical safeguard for clinical deployment. Detected failure cases can be referred to human assessment, ensuring patient safety in computer-aided clinical decision making. Despite its paramount importance, there is insufficient evidence about the ability of state-of-the-art confidence scoring methods to detect test-time failures of classification models in the context of medical imaging. This paper provides a reality check, establishing the performance of in-domain misclassification detection methods, benchmarking 9 widely used confidence scores on 6 medical imaging datasets with different imaging modalities, in multiclass and binary classification settings. Our experiments show that the problem of failure detection is far from being solved. We found that none of the benchmarked advanced methods proposed in the computer vision and machine learning literature can consistently outperform a simple softmax baseline, demonstrating that improved out-of-distribution detection or model calibration do not necessarily translate to improved in-domain misclassification detection. Our developed testbed facilitates future work in this important area.

URL: https://openreview.net/forum?id=VBHuLfnOMf

---


New submissions
===============


Title: ViViT: Curvature Access Through The Generalized Gauss-Newton’s Low-Rank Structure

Abstract: Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks. Existing methods based on implicit multiplication via automatic differentiation or Kronecker-factored block diagonal approximations do not consider noise in the mini-batch. We present ViViT, a curvature model that leverages the GGN’s low-rank structure without further approximations. It allows for efficient computation of eigenvalues, eigenvectors, as well as per-sample first- and second-order directional derivatives. The representation is computed in parallel with gradients in one backward pass and offers a fine-grained cost-accuracy trade-off, which allows it to scale. We demonstrate this by conducting performance benchmarks and substantiate ViViT’s usefulness by studying the impact of noise on the GGN’s structural properties during neural network training.

URL: https://openreview.net/forum?id=DzJ7JfPXkE

---

Title: Learning active learning on imbalanced classes with Attentive Neural Processes

Abstract: Pool-based active learning (AL) is a promising technology for increasing data-efficiency of machine learning models. However, surveys show that performance of recent AL methods is very sensitive to the choice of dataset and training setting, making them unsuitable for general application. Additionally, most AL developments for classification models focus on settings that enjoy a balanced class distribution, while real-life data is often heavily imbalanced. We extend the aforementioned survey results to imbalanced data settings and find that current AL underperforms when data is imbalanced. Additionally, we propose a novel Learning Active Learning (LAL) method that exploits symmetry and independence properties of the active learning problem with an Attentive Conditional Neural Process model. Our approach is based on learning from a myopic oracle, which we observe to provide a strong signal especially in imbalanced data settings. In those settings, our model outperforms a variety of baselines and shows a tendency towards improved stability to changing datasets. However, performance is sensitive to choice of classifier and more work is necessary to reduce the performance the gap with the myopic oracle and to improve scalability. We present our work as a proof-of-concept for LAL on imbalanced data settings and hope our analysis and modelling considerations inspire future LAL work.

URL: https://openreview.net/forum?id=aDh1Q2TfTV

---

Title: Proportional Fairness in Federated Learning

Abstract: With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL, and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.

URL: https://openreview.net/forum?id=ryUHgEdWCQ

---

Title: GSR: A Generalized Symbolic Regression Approach

Abstract: Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes, SR attempts to gain insight into the underlying relationships between the independent variables and the target variable of a given dataset by assembling analytical functions. In this paper, we present GSR, a Generalized Symbolic Regression approach, by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we infer mathematical relationships between the independent variables and some transformation of the target variable. We constrain our search space to a weighted sum of basis functions, and propose a genetic programming approach with a matrix-based encoding scheme. We show that our GSR method outperforms several state-of-the-art methods on the well-known SR benchmark problem sets. Finally, we highlight the strengths of GSR by introducing SymSet, a new SR benchmark set which is more challenging relative to the existing benchmarks.

URL: https://openreview.net/forum?id=lheUXtDNvP

---

Title: Stacking Diverse Architectures to Improve Machine Translation

Abstract: Repeated applications of the same neural block primarily based on self-attention characterize the current state-of-the-art in neural architectures for machine translation. In such architectures the decoder adopts a masked version of the same encoding block. Although simple this strategy doesn't encode the various inductive biases such as locality that arise from alternative architectures and that are central to the modelling of translation. We propose Lasagna, an encoder-decoder model that aims to combine the inductive benefits of different architectures by layering multiple instances of different blocks. Lasagna’s encoder first grows the representation from local to mid-sized using convolutional blocks and only then applies a pair of final self-attention blocks. Lasagna’s decoder uses only convolutional blocks that attend to the encoder representation. On a large suit of machine translation tasks, we find that Lasagna not only matches or outperforms the Transformer baseline, but it does so more efficiently thanks to widespread use of the efficient convolutional blocks. These findings suggest that the widespread use of uniform architectures may be suboptimal in certain scenarios and exploiting the diversity of inductive architectural biases can lead to substantial gains.


URL: https://openreview.net/forum?id=mNEqiC924B

---

Title: Attention as Inference via Fenchel Duality

Abstract: Attention has been widely adopted in many state-of-the-art deep learning models. While the significant performance improvements it brings have attracted great interest, attention is still poorly understood theoretically. This paper presents a new perspective to understand attention by showing that it can be seen as a solver of a family of estimation problems. In particular, we describe a convex optimization problem that arises in a family of estimation tasks commonly appearing in the design of deep learning models. Rather than directly solving the convex optimization problem, we solve its Fenchel dual and derive a closed-form approximation of the optimal solution. Remarkably, the solution gives a generalized attention structure, and its special case is equivalent to the popular dot-product attention adopted in transformer networks. We show that T5 transformer has implicitly adopted the general form of the solution by demonstrating that this expression unifies the word mask and the positional encoding functions. Finally, we discuss how the proposed attention structures can be integrated in practical models and empirically show that the convex optimization problem indeed provides a principle justifying the attention module design.

URL: https://openreview.net/forum?id=XtL7cM4fQy

---

Title: Detecting Anomalies within Time Series using Local Neural Transformations

Abstract: We develop a new method to detect anomalies within time series, which is essential in many
application domains, reaching from self-driving cars, finance, and marketing to medical
diagnosis and epidemiology. The method is based on self-supervised deep learning that has
played a key role in facilitating deep anomaly detection on images, where powerful image
transformations are available. However, such transformations are widely unavailable for time
series. Addressing this, we develop Local Neural Transformations (LNT), a method learning
local transformations of time series from data. The method produces an anomaly score for
each time step and thus can be used to detect anomalies within time series. We prove in
a theoretical analysis that our novel training objective is more suitable for transformation
learning than previous deep Anomaly detection (AD) methods. Our experiments demonstrate
that LNT can find anomalies in speech segments from the LibriSpeech data set and better
detect interruptions to cyber-physical systems than previous work. Visualization of the
learned transformations gives insight into the type of transformations that LNT learns.

URL: https://openreview.net/forum?id=p6xslUyvka

---

Title: Direct Neural Network Training on Securely Encoded Datasets

Abstract: In fields where data privacy and secrecy are critical, such as healthcare and business intelligence, security concerns have limited data availability for neural network training. A recently developed technique securely encodes training, test, and inference examples with an aggregate non-orthogonal and nonlinear transformation that consists of steps of padding, perturbation, and orthogonal transformation, enabling artificial neural network (ANN) training and inference directly on encoded datasets. Here, the performance characteristics of the various aspects of the method are presented. The individual transformations of the method, when applied alone, do not significantly reduce validation accuracy. Training on datasets transformed by sequential padding, perturbation, and orthogonal transformation results in only slightly lower validation accuracies than those seen with unmodified control datasets (relative decreases in accuracy of 0.15% to 0.35%), with no difference in training time seen between transformed and control datasets. The presented methods have broad implications for machine learning in fields requiring data security.

URL: https://openreview.net/forum?id=pP0ABGeLe9

---

Title: Measuring Interventional Robustness in Reinforcement Learning

Abstract: Recent work in reinforcement learning has focused on several characteristics of learned policies that go beyond maximizing reward. These properties include fairness, explainability, generalization, and robustness. In this paper, we define interventional robustness (IR), a measure of how much variability is introduced into learned policies by incidental aspects of the training procedure, such as the order of training data or the particular exploratory actions taken by agents. A training procedure has high IR when the agents it produces take very similar actions under intervention, despite variation in these incidental aspects of the training procedure. We develop an intuitive, quantitative measure of IR and calculate it for eight algorithms in three Atari environments across dozens of interventions and states. From these experiments, we find that IR varies with the amount of training and type of algorithm and that high performance does not imply high IR, as one might expect.

URL: https://openreview.net/forum?id=B9jqhnukny

---

Title: Reinforcement Teaching

Abstract: Machine learning algorithms learn to solve a task, but are unable to improve their ability to learn.
Meta-learning methods learn about machine learning algorithms and improve them so that they learn more quickly. However, existing meta-learning methods are either hand-crafted to improve one specific component of an algorithm or only work with differentiable algorithms.
We develop a unifying meta-learning framework, called \textit{Reinforcement Teaching}, to improve the learning process of \emph{any} algorithm. Under Reinforcement Teaching, a teaching policy is learned, through reinforcement, to improve a student's learning algorithm. To learn an effective teaching policy, we introduce the \textit{parametric-behavior embedder} that learns a representation of the student's learnable parameters from its input/output behavior. We further use \textit{learning progress} to shape the teacher's reward, allowing it to more quickly maximize the student's performance. To demonstrate the generality of Reinforcement Teaching, we conduct experiments in which a teacher learns to significantly improve both reinforcement and supervised learning algorithms. Reinforcement Teaching outperforms previous work using heuristic reward functions and state representations, as well as other parameter representations.

URL: https://openreview.net/forum?id=G2GKiicaJI

---

Title: Learn, Unlearn and Relearn: An Online Learning Paradigm for Deep Neural Networks

Abstract: Deep neural networks (DNNs) are often trained with the premise that the complete training data set is provided ahead of time. However, in real-world scenarios, data often arrive in chunks over time. This leads to important considerations about the optimal strategy for training DNNs, such as whether to fine-tune them with each chunk of incoming data (warm-start) or to retrain them from scratch with the entire corpus of data whenever a new chunk is available. While employing the latter for training can be computationally inefficient, recent work has pointed out the lack of generalization in warm-start models. Therefore, to strike a balance between efficiency and generalization, we introduce \textit{Learn, Unlearn, and Relearn (LURE)} an online learning paradigm for DNNs. LURE interchanges between the unlearning phase, which selectively forgets the undesirable information in the model through weight reinitialization in a data-dependent manner, and the relearning phase, which emphasizes learning on generalizable features. We show that our training paradigm provides consistent performance gains across datasets in both classification and few-shot settings. We further show that it leads to more robust and well-calibrated models.

URL: https://openreview.net/forum?id=WN1O2MJDST

---

Title: The Low-Rank Simplicity Bias in Deep Networks

Abstract: Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of
empirical observations that investigate and extend the hypothesis that deeper networks are inductively biased to find solutions with lower effective rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low effective rank embedding
increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well. We then show that the simplicity bias exists
at both initialization and after training and is resilient to hyper-parameters and learning methods. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance on CIFAR and
ImageNet without changing the modeling capacity.

URL: https://openreview.net/forum?id=bCiNWDmlY2

---

Title: Bayesian Causal Bandits with Backdoor Adjustment Prior

Abstract: The causal bandit problem setting is a sequential decision-making framework where actions of interest correspond to interventions on variables in a system assumed to be governed by a causal model. The underlying causality may be exploited when investigating actions in the interest of optimizing the yield of the reward variable. Most existing approaches assume prior knowledge of the underlying causal graph, which is in practice restrictive and often unrealistic. In this paper, we develop a novel Bayesian framework for tackling causal bandit problems that does not rely on possession of the causal graph, but rather simultaneously learns the causal graph while exploiting causal inferences to optimize the reward. Our methods efficiently utilize joint inferences from interventional and observational data in a unified Bayesian model constructed with intervention calculus and causal graph learning. For the implementation of our proposed methodology in the discrete distributional setting, we derive an approximation of the sampling variance of the backdoor adjustment estimator. In the Gaussian setting, we characterize the interventional variance with intervention calculus and propose a simple graphical criterion to share information between arms. We validate our proposed methodology in an extensive empirical study, demonstrating compelling cumulative regret performance against state-of-the-art standard algorithms as well as optimistic implementations of their causal variants that assume strong prior knowledge of the causal structure.

URL: https://openreview.net/forum?id=sMsGv5Kfm3

---

Title: Semantically Adversarial Scene Generation with Explicit Knowledge Guidance for Autonomous Driving

Abstract: Generating adversarial scenes that potentially fail autonomous driving systems provides an effective way to improve their robustness. Extending purely data-driven generative models, recent specialized models satisfy additional controllable requirements such as embedding a traffic sign in a driving scene by manipulating patterns implicitly at the neuron level. In this paper, we introduce a method to incorporate domain knowledge explicitly in the generation process to achieve Semantically Adversarial Generation (SAG). To be consistent with the composition of driving scenes, we first categorize the knowledge into two types, the property of objects and the relationship among objects. We propose a tree-structured variational auto-encoder (T-VAE) to learn hierarchical scene representation. By imposing semantic rules on the properties of nodes and edges into the tree structure, explicit knowledge integration enables controllable generation. To demonstrate the advantage of structural representation, we construct a synthetic example to illustrate the controllability and explainability of our method in a succinct setting. We further extend to realistic environments for autonomous vehicles, showing that our method efficiently identifies adversarial driving scenes against different state-of-the-art 3D point cloud segmentation models and satisfies the traffic rules specified as explicit knowledge.

URL: https://openreview.net/forum?id=d96WMHRg1i

---

Title: Evaluating the Evaluators: Which UDA validation methods are most effective? Can they be improved?

Abstract: This paper compares and ranks 8 UDA validation methods. Validators estimate model accuracy, which makes them an essential component of any UDA train-test pipeline. We rank these validators to indicate which of them are most useful for the purpose of selecting optimal model checkpoints and hyperparameters. To the best of our knowledge, this large-scale benchmark study is the first of its kind in the UDA field. In addition, we propose 3 new validators that outperform existing validators. When paired with one particular UDA algorithm, one of our new validators achieves state-of-the-art performance.

URL: https://openreview.net/forum?id=1oJp1R9PSJ

---

Title: Recognition Models to Learn Dynamics from Partial Observations with Neural ODEs

Abstract: Identifying dynamical systems from experimental data is a notably difficult task. Prior knowledge generally helps, but the extent of this knowledge varies with the application, and customized models are often needed. Neural ordinary differential equations can be written as a flexible framework for system identification and can incorporate a broad spectrum of physical insight, giving physical interpretability to the resulting latent space. In the case of partial observations, however, the data points cannot directly be mapped to the latent state of the ODE. Hence, we propose to design recognition models, in particular inspired by nonlinear observer theory, to link the partial observations to the latent state. We demonstrate the performance of the proposed approach on numerical simulations and on an experimental dataset from a robotic exoskeleton.

URL: https://openreview.net/forum?id=LTAdaRM29K

---

Title: Computationally-efficient initialisation of GPs: The generalised variogram method

Abstract: We present a computationally-efficient strategy to find the hyperparameters of a Gaussian process (GP) avoiding the computation of the likelihood function. The found hyperparameters can then be used directly for regression or passed as initial conditions to maximum-likelihood (ML) training. Motivated by the fact that training a GP via ML is equivalent (on average) to minimising the KL-divergence between the true and learnt model, we set to explore different metrics/divergences among GPs that are computationally inexpensive and provide estimates close to those of ML. In particular, we identify the GP hyperparameters by matching the empirical covariance to a parametric candidate, proposing and studying various measures of discrepancy. Our proposal extends the Variogram method developed by the geostatistics literature and thus is referred to as the Generalised Variogram method (GVM). In addition to the theoretical presentation of GVM, we provide experimental validation in terms of accuracy, consistency with ML and computational complexity for different kernels using synthetic and real-world data.

URL: https://openreview.net/forum?id=slsAQHpS7n

---

Reply all
Reply to author
Forward
0 new messages