Weekly TMLR digest for Sep 25, 2022

14 views

Skip to first unread message

TMLR

unread,

Sep 24, 2022, 8:00:13 PM9/24/22

to tmlr-annou...@googlegroups.com

Accepted papers
===============

Title: Faking Interpolation Until You Make It

Authors: Alasdair Paren, Rudra P. K. Poudel, M. Pawan Kumar

Abstract: Deep over-parameterized neural networks exhibit the interpolation property on many data sets. Specifically, these models can achieve approximately zero loss on all training samples simultaneously. This property has been exploited to develop optimisation algorithms for this setting. These algorithms use the fact that the optimal loss value is known to employ a variation of a Polyak step size calculated on each stochastic batch of data. We introduce a novel extension of this idea to tasks where the interpolation property does not hold. As we no longer have access to the optimal loss values a priori, we instead estimate them for each sample online. To realise this, we introduce a simple but highly effective heuristic for approximating the optimal value based on previous loss evaluations. We provide rigorous experimentation on a range of problems. From our empirical analysis we demonstrate the effectiveness of our approach, which outperforms other single hyperparameter optimisation methods.

URL: https://openreview.net/forum?id=OslAMMF4ZP

---

Title: Ensembles of Classifiers: a Bias-Variance Perspective

Authors: Neha Gupta, Jamie Smith, Ben Adlam, Zelda E Mariet

Abstract: Ensembles are a straightforward, remarkably effective method for improving the accuracy, calibration, and robustness of neural networks on classification tasks. Yet, the reasons underlying their success remain an active area of research. Building upon (Pfau, 2013), we turn to the bias-variance decomposition of Bregman divergences in order to gain insight into the behavior of ensembles under classification losses. Introducing a dual reparameterization of the bias-variance decomposition, we first derive generalized laws of total expectation and variance, then discuss how bias and variance terms can be estimated empirically. Next, we show that the dual reparameterization naturally introduces a way of constructing ensembles which reduces the variance and leaves the bias unchanged. Conversely, we show that ensembles that directly average model outputs can arbitrarily increase or decrease the bias. Empirically, we see that such ensembles of neural networks may reduce the bias. We conclude with an empirical analysis of ensembles over neural network architecture hyperparameters, revealing that these techniques allow for more efficient bias reduction than standard ensembles.

URL: https://openreview.net/forum?id=lIOQFVncY9

---

Title: GFNet: Geometric Flow Network for 3D Point Cloud Semantic Segmentation

Authors: Haibo Qiu, Baosheng Yu, Dacheng Tao

Abstract: Point cloud semantic segmentation from projected views, such as range-view (RV) and bird's-eye-view (BEV), has been intensively investigated. Different views capture different information of point clouds and thus are complementary to each other. However, recent projection-based methods for point cloud semantic segmentation usually utilize a vanilla late fusion strategy for the predictions of different views, failing to explore the complementary information from a geometric perspective during the representation learning. In this paper, we introduce a geometric flow network (GFNet) to explore the geometric correspondence between different views in an align-before-fuse manner. Specifically, we devise a novel geometric flow module (GFM) to bidirectionally align and propagate the complementary information across different views according to geometric relationships under the end-to-end learning scheme. We perform extensive experiments on two widely used benchmark datasets, SemanticKITTI and nuScenes, to demonstrate the effectiveness of our GFNet for project-based point cloud semantic segmentation. Concretely, GFNet not only significantly boosts the performance of each individual view but also achieves state-of-the-art results over all existing projection-based models. Code is available at \url{https://github.com/haibo-qiu/GFNet}.

URL: https://openreview.net/forum?id=LSAAlS7Yts

---

Title: Deep Learning for Bayesian Optimization of Scientific Problems with High-Dimensional Structure

Authors: Samuel Kim, Peter Y Lu, Charlotte Loh, Jamie Smith, Jasper Snoek, Marin Soljacic

Abstract: Bayesian optimization (BO) is a popular paradigm for global optimization of expensive black-box functions, but there are many domains where the function is not completely a black-box. The data may have some known structure (e.g.\ symmetries) and/or the data generation process may be a composite process that yields useful intermediate or auxiliary information in addition to the value of the optimization objective. However, surrogate models traditionally employed in BO, such as Gaussian Processes (GPs), scale poorly with dataset size and do not easily accommodate known structure. Instead, we use Bayesian neural networks, a class of scalable and flexible surrogate models with inductive biases, to extend BO to complex, structured problems with high dimensionality. We demonstrate BO on a number of realistic problems in physics and chemistry, including topology optimization of photonic crystal materials using convolutional neural networks, and chemical property optimization of molecules using graph neural networks. On these complex tasks, we show that neural networks often outperform GPs as surrogate models for BO in terms of both sampling efficiency and computational cost.

URL: https://openreview.net/forum?id=tPMQ6Je2rB

---

Title: Evolving Decomposed Plasticity Rules for Information-Bottlenecked Meta-Learning

Authors: Fan Wang, Hao Tian, Haoyi Xiong, Hua Wu, Jie Fu, Yang Cao, Kang Yu, Haifeng Wang

Abstract: Artificial neural networks (ANNs) are typically confined to accomplishing pre-defined tasks by learning a set of static parameters. In contrast, biological neural networks (BNNs) can adapt to various new tasks by continually updating the neural connections based on the inputs, which is aligned with the paradigm of learning effective learning rules in addition to static parameters, \textit{e.g.}, meta-learning. Among various biologically inspired learning rules, Hebbian plasticity updates the neural network weights using local signals without the guide of an explicit target function, thus enabling an agent to learn automatically without human efforts. However, typical plastic ANNs using a large amount of meta-parameters violate the nature of the genomics bottleneck and potentially deteriorate the generalization capacity. This work proposes a new learning paradigm decomposing those connection-dependent plasticity rules into neuron-dependent rules thus accommodating $\Theta(n^2)$ learnable parameters with only $\Theta(n)$ meta-parameters. We also thoroughly study the effect of different neural modulation on plasticity. Our algorithms are tested in challenging random 2D maze environments, where the agents have to use their past experiences to shape the neural connections and improve their performances for the future. The results of our experiment validate the following: 1. Plasticity can be adopted to continually update a randomly initialized RNN to surpass pre-trained, more sophisticated recurrent models, especially when coming to long-term memorization. 2. Following the genomics bottleneck, the proposed decomposed plasticity can be comparable to or even more effective than canonical plasticity rules in some instances.

URL: https://openreview.net/forum?id=6qMKztPn0n

---

Title: Multitask Online Mirror Descent

Authors: Nicolò Cesa-Bianchi, Pierre Laforgue, Andrea Paudice, massimiliano pontil

Abstract: We introduce and analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks. We prove that the regret of MT-OMD is of order $\sqrt{1 + \sigma^2(N-1)}\sqrt{T}$, where $\sigma^2$ is the task variance according to the geometry induced by the regularizer, $N$ is the number of tasks, and $T$ is the time horizon. Whenever tasks are similar, that is $\sigma^2 \le 1$, our method improves upon the $\sqrt{NT}$ bound obtained by running independent OMDs on each task. We further provide a matching lower bound, and show that our multitask extensions of Online Gradient Descent and Exponentiated Gradient, two major instances of OMD, enjoy closed-form updates, making them easy to use in practice. Finally, we present experiments which support our theoretical findings.

URL: https://openreview.net/forum?id=zwRX9kkKzj

---

Title: Approximating 1-Wasserstein Distance with Trees

Authors: Makoto Yamada, Yuki Takezawa, Ryoma Sato, Han Bao, Zornitsa Kozareva, Sujith Ravi

Abstract: The Wasserstein distance, which measures the discrepancy between distributions, shows efficacy in various types of natural language processing and computer vision applications. One of the challenges in estimating the Wasserstein distance is that it is computationally expensive and does not scale well for many distribution-comparison tasks. In this study, we aim to approximate the 1-Wasserstein distance by the tree-Wasserstein distance (TWD), where the TWD is a 1-Wasserstein distance with tree-based embedding that can be computed in linear time with respect to the number of nodes on a tree. More specifically, we propose a simple yet efficient L1-regularized approach for learning the weights of edges in a tree. To this end, we first demonstrate that the 1-Wasserstein approximation problem can be formulated as a distance approximation problem using the shortest path distance on a tree. We then show that the shortest path distance can be represented by a linear model and formulated as a Lasso-based regression problem. Owing to the convex formulation, we can efficiently obtain a globally optimal solution. We also propose a tree-sliced variant of these methods. Through experiments, we demonstrate that the TWD can accurately approximate the original 1-Wasserstein distance by using the weight estimation technique. Our code can be found in the GitHub repository.

URL: https://openreview.net/forum?id=Ig82l87ZVU

---

New submissions
===============

Title: Better Theory for SGD in the Nonconvex World

Abstract: Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced \emph{expected smoothness} assumption which governs the behavior of the second moment of the stochastic gradient. We show that our assumption is both more general and more reasonable than assumptions made in all prior work. Moreover, our results yield the optimal $\mathcal{O}(\epsilon^{-4})$ rate for finding a stationary point of nonconvex smooth functions, and recover the optimal $\mathcal{O}(\epsilon^{-1})$ rate for finding a global solution if the Polyak-Łojasiewicz condition is satisfied. We compare against convergence rates under convexity and prove a theorem on the convergence of SGD under Quadratic Functional Growth and convexity, which might be of independent interest. Moreover, we perform our analysis in a framework which allows for a detailed study of the effects of a wide array of sampling strategies and minibatch sizes for finite-sum optimization problems. We corroborate our theoretical results with experiments on real and synthetic data.

URL: https://openreview.net/forum?id=AU4qHN2VkS

---

Title: Attention Beats Concatenation for Conditioning Neural Fields

Abstract: Neural fields model signals by mapping coordinate inputs to sampled values. They are becoming an increasingly important backbone architecture across many fields from vision and graphics to biology and astronomy. In this paper, we explore the differences between common conditioning mechanisms within these networks, an essential ingredient in shifting neural fields from memorization of signals to generalization, where the set of signals lying on a manifold is modelled jointly. In particular, we are interested in the scaling behaviour of these mechanisms to increasingly high-dimensional conditioning variables. As we show in our experiments, high-dimensional conditioning is key to modelling complex data distributions, thus it is important to determine what architecture choices best enable this when working on such problems. To this end, we run experiments modelling 2D, 3D, and 4D signals with neural fields, employing concatenation, hyper-network, and attention-based conditioning strategies -- a necessary but laborious effort that has not been performed in the literature.
We find that attention-based conditioning outperforms other approaches in a variety of settings.

URL: https://openreview.net/forum?id=GzqdMrFQsE

---

Title: Stochastic Douglas-Rachford Splitting for Regularized Empirical Risk Minimization: Convergence, Mini-batch, and Implementation

Abstract: In this paper, we study the stochastic Douglas-Rachford splitting (SDRS) for general empirical risk minimization (ERM) problems with regularization. Our first contribution is to close the theoretical gap by proving its convergence for both convex and strongly convex problems; the convergence rates are $O(1/\sqrt{t})$ and $O(1/t)$, respectively. Since SDRS reduces to the stochastic proximal point algorithm (SPPA) when there is no regularization, it is pleasing to see the result matches that of SPPA, under the same mild conditions. We also propose the mini-batch version of SDRS that handles multiple samples simultaneously while maintaining the same efficiency as that of a single one, which is not a straight-forward extension in the context of stochastic proximal algorithms. We show that the mini-batch SDRS again enjoys the same convergence rate. Furthermore, we demonstrate that, for some of the canonical regularized ERM problems, each iteration of SDRS can be efficiently calculated either in closed form or in close to closed form via bisection---the resulting complexity is identical to, for example, the stochastic (sub)gradient method. Experiments on real data demonstrate its effectiveness in terms of convergence compared to SGD and its variants.

URL: https://openreview.net/forum?id=uvDD9rN6Zz

---

Title: Adjusting Machine Learning Decisions for Equal Opportunity and Counterfactual Fairness

Abstract: Machine learning (ML) methods have the potential to automate
high-stakes decisions, such as bail admissions or credit lending, by
analyzing and learning from historical data. But these algorithmic
decisions may be unfair: in learning from historical data, they may
replicate discriminatory practices from the past. In this paper, we
propose two algorithms that adjust fitted ML predictors to produce
decisions that are fair. Our methods provide post-hoc adjustments to
the predictors, without requiring that they be retrained. We consider
a causal model of the ML decisions, define fairness through
counterfactual decisions within the model, and then form algorithmic
decisions that capture the historical data as well as possible but
are provably fair. In particular, we consider two definitions of
fairness. The first is ``equal counterfactual opportunity,'' where
the counterfactual distribution of the decision is the same regardless
of the protected attribute; the second is counterfactual fairness. We
evaluate the algorithms, and the trade-off between accuracy and
fairness, on datasets about admissions, income, credit, and
recidivism.

URL: https://openreview.net/forum?id=P6NcRPb13w

---

Title: Modeling Bounded Rationality in Multi-Agent Simulations Using Rationally Inattentive Reinforcement Learning

Abstract: Multi-agent reinforcement learning (MARL) is a powerful framework for studying emergent behavior in complex agent-based simulations. However, RL agents are often assumed to be rational and behave optimally, which does not fully reflect human behavior. In this work, we propose a new, more human-like RL agent, which incorporates an established model of human-irrationality, the Rational Inattention (RI) model. RI models the cost of cognitive information processing using mutual information. Our RIRL framework generalizes and is more flexible than prior work by allowing for multi-timestep dynamics and information channels with heterogeneous processing costs. We demonstrate the flexibility of RIRL in versions of a classic economic setting (Principal-Agent setting) with varying complexity. In simple settings, we show using RIRL can lead to optimal agent behavior policy with approximately the same functional form as what is expected from the analysis of prior work, which utilizes theoretical methods. We additionally demonstrate that using RIRL to analyze complex, theoretically intractable settings, yields a rich spectrum of new equilibrium behaviors that differ from those found under rational assumptions. For example, increasing the cognitive cost experienced by a manager agent results in the other agents increasing the magnitude of their action to compensate. These results suggest RIRL is a powerful tool towards building AI agents that can mimic real human behavior.

URL: https://openreview.net/forum?id=DY1pMrmDkm

---

Title: MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

Abstract: Feature representation learning is the key recipe for learning-based Multi-View Stereo (MVS). As the common feature extractor of learning-based MVS, vanilla Feature Pyramid Networks (FPNs) suffer from discouraged feature representations for reflection and texture-less areas, which limits the generalization of MVS. Even FPNs worked with pre-trained Convolutional Neural Networks (CNNs) fail to tackle these issues. On the other hand, Vision Transformers (ViTs) have achieved prominent success in many 2D vision tasks. Thus we ask whether ViTs can facilitate feature learning in MVS? In this paper, we propose a pre-trained ViT enhanced MVS network called MVSFormer, which can learn more reliable feature representations benefited by informative priors from ViT. Then MVSFormer-P and MVSFormer-H are further proposed with frozen ViT weights and trainable ones respectively. MVSFormer-P is more efficient while MVSFormer-H can achieve superior performance. MVSFormer can be generalized to various input resolutions with efficient multi-scale training strengthened by gradient accumulation. Moreover, we discuss the merits and drawbacks of classification and regression-based MVS methods, and further propose to unify them with a temperature-based strategy. MVSFormer achieves state-of-the-art performance on the DTU dataset. Particularly, our anonymous Tanks-and-Temples submission of MVSFormer was submitted in May/2022. And until the day of our paper is submitted to TMLR, our results are still ranked as Top-1 on both intermediate and advanced sets of the highly competitive Tanks-and-Temples leaderboard. Codes and models will be released upon acceptance.

URL: https://openreview.net/forum?id=2VWR6JfwNo

---

Title: Revisiting adversarial training for the worst-performing class

Abstract: Despite progress in adversarial training (AT), there is a substantial gap between top-performing and worst-performing classes, e.g. on CIFAR10 the accuracies for the best class and the worst class are 74% and 23%, respectively. We argue that the gap can be reduced by explicitly optimizing the worst-performing class, which results in a min-max-max optimization formulation. We provide high probability convergence guarantees of the worst class loss for our method, called class focused online learning (CFOL), which can be plugged into existing training setups with virtually no overhead in computation. We observe a significant improvement on the worst class accuracy of 32% for CIFAR10 and verify a consistent behavior across CIFAR100 and STL10. Our study sheds a light on moving beyond average accuracy, which can be beneficial in safety-critical applications.

URL: https://openreview.net/forum?id=wkecshlYxI

---

Title: CrossMAE: Disentangled Masked Image Modeling with Cross Prediction

Abstract: We present CrossMAE, a novel and flexible masked image modeling (MIM) approach, which disentangles the existing mask-then-predict objective in MIM progressively into mask-then-draw and draw-then-predict. During the mask-then-draw phase, an auxiliary drawing head models the uncertainty and produces coarse and diverse outputs. Subsequently, in draw-then-predict, the backbone receives the completions and strives to predict versatile signals from them. These two disentangled objectives are end-to-end trainable and involved in a single pass, splitting low-level generation and high-level understanding. Through extensive experiments and compelling results on a variety of tasks, we demonstrate that the proposed pre-training scheme learns generalizable features effectively, including image classification, semantic segmentation, object detection, instance segmentation, and even facial landmark detection. Beyond surpassing existing MIM counterparts, CrossMAE exhibits better data efficiency, in both pre-training and fine-tuning.

URL: https://openreview.net/forum?id=wG8KcY3fuX

---

Title: Regularising for invariance to data augmentation improves supervised learning

Abstract: Data augmentation is used in machine learning to make the classifier invariant to label-preserving transformations. Usually this invariance is only encouraged implicitly by sampling a single augmentation per image and training epoch. However, several works have recently shown that using multiple augmentations per input can improve generalisation or can be used to incorporate invariances more explicitly. In this work, we first empirically compare these recently proposed objectives that differ in whether they rely on explicit or implicit regularisation and at what level of the predictor they encode the invariances. We show that the predictions of the best performing method are also the most similar when compared on different augmentations of the same input. Inspired by this observation, we propose an explicit regulariser that encourages this invariance on the level of individual model predictions. Through extensive experiments on CIFAR-100 and ImageNet we show that this explicit regulariser (i) improves generalisation and (ii) equalises performance differences between all considered objectives. Our results suggest that objectives that encourage invariance on the level of the neural network features itself generalise better than those that only achieve invariance by averaging predictions of non-invariant models.

URL: https://openreview.net/forum?id=1LXRiPYSC4

---

Title: Continual few-shot learning with Hippocampal-inspired replay

Abstract: Continual learning and few-shot learning are important frontiers in the quest to improve Machine Learning. There is a growing body of work in each frontier, but very little combining the two. Recently however, Antoniou et al. (2020) introduced a Continual Few-shot Learning framework, CFSL, that combines both. In this study, we extended CFSL to make it more comparable to standard continual learning experiments, where usually a much larger number of classes are presented. We also introduced an `instance test' to classify very similar specific instances - a capability of animal cognition that is usually neglected in ML. We selected representative baseline models from the original CFSL work and compared to a model with Hippocampal-inspired replay, as the Hippocampus is considered to be vital to this type of learning in animals. As expected, learning more classes is more difficult than the original CFSL experiments, and interestingly, the way in which they are presented makes a difference to performance. Accuracy in the instance test is comparable to the classification tasks. The use of replay for consolidation improves performance substantially for both types of tasks, particularly the instance test.

URL: https://openreview.net/forum?id=dsGJo0GuF5

---

Title: FedDAG: Federated DAG Structure Learning

Abstract: To date, most directed acyclic graphs (DAGs) structure learning approaches require data to be stored in a central server. However, due to the consideration of privacy, data owners gradually refuse to share their personalized raw data to avoid information leakage, making this task more troublesome by cutting off the first step. Thus, a puzzle arises: how do we discover the underlying DAG structure from decentralized data? In this paper, focusing on the additive noise models (ANMs) assumption of data generation, we take the first step in developing a gradient-based learning framework named FedDAG, which can learn the DAG structure without directly touching the local data and also naturally handle the data heterogeneity. Our method benefits from a two-level structure of each local model. The first level structure learns the edges and directions of the graph and communicates with the server to get the model information from other clients during the learning procedure, while the second level structure approximates the relationship mechanisms between variables and personally updates on its own data to accommodate the data heterogeneity. Moreover, FedDAG formulates the overall learning task as a continuous optimization problem by taking advantage of an equality acyclicity constraint, which can be solved by gradient descent methods to boost the searching efficiency. Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.

URL: https://openreview.net/forum?id=MzWgBjZ6Le

---

Title: Controllable Generative Modeling via Causal Reasoning

Abstract: Deep latent variable generative models excel at generating complex, high-dimensional data, often exhibiting impressive generalization beyond the training distribution. However, many such models in use today are black-boxes trained on large unlabelled datasets with statistical objectives and lack an interpretable understanding of the latent space required for controlling the generative process.
We propose CAGE, a framework for controllable generation in latent variable models based on causal reasoning.
Given a pair of attributes, CAGE infers the implicit cause-effect relationships between these attributes as induced by a deep generative model. This is achieved by defining and estimating a novel notion of unit-level causal effects in the latent space of the generative model.
Thereafter, we use the inferred cause-effect relationships to design a novel strategy for controllable generation based on counterfactual sampling. Through a series of large-scale synthetic and human evaluations, we demonstrate that generating counterfactual samples which respect the underlying causal relationships inferred via CAGE leads to subjectively more realistic images.

URL: https://openreview.net/forum?id=Z44YAcLaGw

---

Title: Named Tensor Notation

Abstract: We propose a notation for tensors with named axes, which relieves the author, reader, and future implementers of machine learning models from the burden of keeping track of the order of axes and the purpose of each. The notation makes it easy to lift operations on low-order tensors to higher order ones, for example, from images to minibatches of images, or from an attention mechanism to multiple attention heads.

After a brief overview and formal definition of the notation, we illustrate it through several examples from modern machine learning, from building blocks like attention and convolution to full models like Transformers and LeNet. We then discuss differential calculus in our notation and compare with some alternative notations. Our proposals build on ideas from many previous papers and software libraries. We hope that this document will encourage more authors to use named tensors, resulting in clearer papers and more precise implementations.

URL: https://openreview.net/forum?id=hVT7SHlilx

---

Reply all

Reply to author

Forward

0 new messages