Weekly TMLR digest for Sep 18, 2022

10 views

Skip to first unread message

TMLR

unread,

Sep 17, 2022, 8:00:09 PM9/17/22

to tmlr-annou...@googlegroups.com

Accepted papers
===============

Title: Do better ImageNet classifiers assess perceptual similarity better?

Authors: Manoj Kumar, Neil Houlsby, Nal Kalchbrenner, Ekin Dogus Cubuk

Abstract: Perceptual distances between images, as measured in the space of pre-trained deep features, have outperformed prior low-level, pixel-based metrics on assessing image similarity. While the capabilities of older and less accurate models such as AlexNet and VGG to capture perceptual similarity are well known, modern and more accurate models are less studied. In this paper, we present a large-scale empirical study to assess how well ImageNet classifiers perform on perceptual similarity. First, we observe a inverse correlation between ImageNet accuracy and Perceptual Scores of modern networks such as ResNets, EfficientNets, and Vision Transformers: that is better classifiers achieve worse Perceptual Scores. Then, we examine the ImageNet accuracy/Perceptual Score relationship on varying the depth, width, number of training steps, weight decay, label smoothing, and dropout. Higher accuracy improves Perceptual Score up to a certain point, but we uncover a Pareto frontier between accuracies and Perceptual Score in the mid-to-high accuracy regime. We explore this relationship further using a number of plausible hypotheses such as distortion invariance, spatial frequency sensitivity, and alternative perceptual functions. Interestingly we discover shallow ResNets and ResNets trained for less than 5 epochs only on ImageNet, whose emergent Perceptual Score matches the prior best networks trained directly on supervised human perceptual judgements.

URL: https://openreview.net/forum?id=qrGKGZZvH0

---

Title: Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent

Authors: Ajaykrishna Karthikeyan, Naman Jain, Nagarajan Natarajan, Prateek Jain

Abstract: Decision trees provide a rich family of highly non-linear but efficient models, due to which they continue to be the go-to family of predictive models by practitioners across domains. But learning trees is challenging due to their discrete decision boundaries. The state-of-the-art (SOTA) techniques resort to (a) learning soft trees thereby losing logarithmic inference time; or (b) using methods tailored to specific supervised learning settings, requiring access to labeled examples and loss function. In this work, by leveraging techniques like overparameterization and straight-through estimators, we propose a unified method that enables accurate end-to-end gradient based tree training and can be deployed in a variety of settings like offline supervised learning and online learning with bandit feedback. Using extensive validation on standard benchmarks, we demonstrate that our method provides best of both worlds, i.e., it is competitive to, and in some cases more accurate than methods designed specifically for the supervised settings; and in bandit settings, where most existing tree learning techniques are not applicable, our models are still accurate and significantly outperform the applicable SOTA methods.

URL: https://openreview.net/forum?id=u0n1chY0b6

---

Title: Probabilistic Autoencoder

Authors: Vanessa M Boehm, Uros Seljak

Abstract: Principal Component Analysis (PCA) minimizes the reconstruction error given a class of linear models of fixed component dimensionality. Probabilistic PCA adds a probabilistic structure by learning the probability distribution of the PCA latent space weights, thus creating a generative model. Autoencoders (AE) minimize the reconstruction error in a class of nonlinear models of fixed latent space dimensionality and outperform PCA at fixed dimensionality. Here, we introduce the Probabilistic Autoencoder (PAE) that learns the probability distribution of the AE latent space weights using a normalizing flow (NF). The PAE is fast and easy to train and achieves small reconstruction errors, high sample quality, and good performance in downstream tasks. We compare the PAE to Variational AE (VAE), showing that the PAE trains faster, reaches a lower reconstruction error, and produces good sample quality without requiring special tuning parameters or training procedures.
We further demonstrate that the PAE is a powerful model for performing the downstream tasks of probabilistic image reconstruction in the context of Bayesian inference of inverse problems for inpainting and denoising applications. Finally, we identify latent space density from NF as a promising outlier detection metric.

URL: https://openreview.net/forum?id=AEoYjvjKVA

---

Title: Decoder Denoising Pretraining for Semantic Segmentation

Authors: Emmanuel Asiedu Brempong, Simon Kornblith, Ting Chen, Niki Parmar, Matthias Minderer, Mohammad Norouzi

Abstract: Semantic segmentation labels are expensive and time consuming to acquire. Hence, pretraining is commonly used to improve the label-efficiency of segmentation models. Typically, the encoder of a segmentation model is pretrained as a classifier and the decoder is randomly initialized. Here, we argue that random initialization of the decoder can be suboptimal, especially when few labeled examples are available. We propose a decoder pretraining approach based on denoising, which can be combined with supervised pretraining of the encoder. We find that decoder denoising pretraining on the ImageNet dataset strongly outperforms encoder-only supervised pretraining. Despite its simplicity, decoder denoising pretraining achieves state-of-the-art results on label-efficient semantic segmentation and offers considerable gains on the Cityscapes, Pascal Context, and ADE20K datasets.

URL: https://openreview.net/forum?id=D3WI0QG7dC

---

Title: Can You Win Everything with A Lottery Ticket?

Authors: Tianlong Chen, Zhenyu Zhang, Jun Wu, Randy Huang, Sijia Liu, Shiyu Chang, Zhangyang Wang

Abstract: $\textit{Lottery ticket hypothesis}$ (LTH) has demonstrated to yield independently trainable and highly sparse neural networks (a.k.a. $\textit{winning tickets}$), whose test set accuracies can be surprisingly on par or even better than dense models. However, accuracy is far from the only evaluation metric, and perhaps not always the most important one. Hence it might be myopic to conclude that a sparse subnetwork can replace its dense counterpart, even if the accuracy is preserved. Spurred by that, we perform the first comprehensive assessment of lottery tickets from diverse aspects beyond test accuracy, including $\textit{(i)}$ generalization to distribution shifts, $\textit{(ii)}$ prediction uncertainty, $\textit{(iii)}$ interpretability, and $\textit{(iv)}$ geometry of loss landscapes. With extensive experiments across datasets {CIFAR-10, CIFAR-100, and ImageNet}, model architectures, as well as tens of sparsification methods, we thoroughly characterize the trade-off between model sparsity and the all-dimension model capabilities. We find that an appropriate sparsity (e.g., $20\%\sim99.08\%$) can yield the winning ticket to perform comparably or even better $\textbf{in all above four aspects}$, although some aspects (generalization to certain distribution shifts, and uncertainty) appear more sensitive to the sparsification than others. We term it as a $\texttt{LTH-PASS}$. Overall, our results endorse choosing a good sparse subnetwork of a larger dense model, over directly training a small dense model of similar parameter counts. We hope that our study can offer more in-depth insights on pruning, for researchers and engineers who seek to incorporate sparse neural networks for user-facing deployments. Codes are available in: https://github.com/VITA-Group/LTH-Pass.

URL: https://openreview.net/forum?id=JL6MU9XFzW

---

Title: HEAT: Hyperedge Attention Networks

Authors: Dobrik Georgiev Georgiev, Marc Brockschmidt, Miltiadis Allamanis

Abstract: Learning from structured data is a core machine learning task. Commonly, such data is represented as graphs, which normally only consider (typed) binary relationships between pairs of nodes. This is a substantial limitation for many domains with highly-structured data. One important such domain is source code, where hypergraph-based representations can better capture the semantically rich and structured nature of code.
In this work, we present HEAT, a neural model capable of representing typed and qualified hypergraphs, where each hyperedge explicitly qualifies how participating nodes contribute. It can be viewed as a generalization of both message passing neural networks and Transformers. We evaluate HEAT on knowledge base completion and on bug detection and repair using a novel hypergraph representation of programs. In both settings, it outperforms strong baselines, indicating its power and generality.

URL: https://openreview.net/forum?id=gCmQK6McbR

---

Title: On the Near-Optimality of Local Policies in Large Cooperative Multi-Agent Reinforcement Learning

Authors: Washim Uddin Mondal, Vaneet Aggarwal, Satish Ukkusuri

Abstract: We show that in a cooperative $N$-agent network, one can design locally executable policies for the agents such that the resulting discounted sum of average rewards (value) well approximates the optimal value computed over all (including non-local) policies. Specifically, we prove that, if $|\mathcal{X}|, |\mathcal{U}|$ denote the size of state, and action spaces of individual agents, then for sufficiently small discount factor, the approximation error is given by $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]$. Moreover, in a special case where the reward and state transition functions are independent of the action distribution of the population, the error improves to $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\sqrt{|\mathcal{X}|}$. Finally, we also devise an algorithm to explicitly construct a local policy. With the help of our approximation results, we further establish that the constructed local policy is within $\mathcal{O}(\max\{e,\epsilon\})$ distance of the optimal policy, and the sample complexity to achieve such a local policy is $\mathcal{O}(\epsilon^{-3})$, for any $\epsilon>0$.

URL: https://openreview.net/forum?id=t5HkgbxZp1

---

Title: Weight Expansion: A New Perspective on Dropout and Generalization

Authors: Gaojie Jin, Xinping Yi, Pengfei Yang, Lijun Zhang, Sven Schewe, Xiaowei Huang

Abstract: While dropout is known to be a successful regularization technique, insights into the mechanisms that lead to this success are still lacking. We introduce the concept of weight expansion, an increase in the signed volume of a parallelotope spanned by the column or row vectors of the weight covariance matrix, and show that weight expansion is an effective means of increasing the generalization in a PAC-Bayesian setting. We provide a theoretical argument that dropout leads to weight expansion and extensive empirical support for the correlation between dropout and weight expansion. To support our hypothesis that weight expansion can be regarded as an indicator of the enhanced generalization capability endowed by dropout, and not just as a mere by-product, we have studied other methods that achieve weight expansion (resp.\ contraction), and found that they generally lead to an increased (resp.\ decreased) generalization ability. This suggests that dropout is an attractive regularizer, because it is a computationally cheap method for obtaining weight expansion. This insight justifies the role of dropout as a regularizer, while paving the way for identifying regularizers that promise improved generalization through weight expansion.

URL: https://openreview.net/forum?id=w3z3sN1b04

---

Title: Exploring Efficient Few-shot Adaptation for Vision Transformers

Authors: Chengming Xu, Siqian Yang, Yabiao Wang, Zhanxiong Wang, Yanwei Fu, Xiangyang Xue

Abstract: The task of Few-shot Learning (FSL) aims to do the inference on novel categories containing only few labeled examples, with the help of knowledge learned from base categories containing abundant labeled training samples. While there are numerous works into FSL task, Vision Transformers (ViTs) have rarely been taken as the backbone to FSL with few trials focusing on naive finetuning of whole backbone or classification layer. Essentially, despite ViTs have been shown to enjoy comparable or even better performance on other vision tasks, it is still very nontrivial to efficiently finetune the ViTs in real-world FSL scenarios. To this end, we propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the FSL tasks. The key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA) for the task and backbone finetuning, individually. Specifically, in APT, the prefix is projected to new key and value pairs that are attached to each self-attention layer to provide the model with task-specific information. Moreover, we design the DRA in the form of learnable offset vectors to handle the potential domain gaps between base and novel data. To ensure the APT would not deviate from the initial task-specific information much, we further propose a novel prototypical regularization, which minimizes the similarity between the projected distribution of prefix and initial prototypes, regularizing the update procedure. Our method receives outstanding performance on the challenging Meta-Dataset. We conduct extensive experiments to show the efficacy of our model. Our model and codes will be released.

URL: https://openreview.net/forum?id=n3qLz4eL1l

---

Title: Exploring the Learning Mechanisms of Neural Division Modules

Authors: Bhumika Mistry, Katayoun Farrahi, Jonathon Hare

Abstract: Of the four fundamental arithmetic operations (+, -, $\times$, $\div$), division is considered the most difficult for both humans and computers.
In this paper, we show that robustly learning division in a systematic manner remains a challenge even at the simplest level of dividing two numbers. We propose two novel approaches for division which we call the Neural Reciprocal Unit (NRU) and the Neural Multiplicative Reciprocal Unit (NMRU), and present improvements for an existing division module, the Real Neural Power Unit (Real NPU). In total we measure robustness over 475 different training sets for setups with and without input redundancy. We discover robustness is greatly affected by the input sign for the Real NPU and NRU, input magnitude for the NMRU and input distribution for every module. Despite this issue, we show that the modules can learn as part of larger end-to-end networks.

URL: https://openreview.net/forum?id=HjelcW6wio

---

Title: Domain Invariant Adversarial Learning

Authors: Matan Levi, Idan Attias, Aryeh Kontorovich

Abstract: The phenomenon of adversarial examples illustrates one of the most basic vulnerabilities of deep neural networks. Among the variety of techniques introduced to surmount this inherent weakness, adversarial training has emerged as the most effective strategy for learning robust models. Typically, this is achieved by balancing robust and natural objectives. In this work, we aim to further optimize the trade-off between robust and standard accuracy by enforcing a domain-invariant feature representation. We present a new adversarial training method, Domain Invariant Adversarial Learning (DIAL), which learns a feature representation that is both robust and domain invariant. DIAL uses a variant of Domain Adversarial Neural Network (DANN) on the natural domain and its corresponding adversarial domain. In the case where the source domain consists of natural examples and the target domain is the adversarially perturbed examples, our method learns a feature representation constrained not to discriminate between the natural and adversarial examples, and can therefore achieve a more robust representation. DIAL is a generic and modular technique that can be easily incorporated into any adversarial training method. Our experiments indicate that incorporating DIAL in the adversarial training process improves both robustness and standard accuracy.

URL: https://openreview.net/forum?id=U8uJAUMzj9

---

New submissions
===============

Title: Affordance Extraction with an External Knowledge Database for Text-Based Simulated Environments

Abstract: Text-based simulated environments have proven to be a valid testbed for machine learning approaches. The process of affordance extraction can be used to generate possible actions for interaction within such an environment. In this paper the capabilities and challenges for utilizing external knowledge databases (in particular ConceptNet) in the process of affordance extraction are studied. An algorithm for automated affordance extraction is introduced and evaluated on the Interactive Fiction (IF) platforms TextWorld and Jericho. For this purpose, the collected affordances are translated into text commands for IF agents. To probe the quality of the automated evaluation process, an additional human baseline study is conducted. The paper illustrates that, despite some challenges, external databases can in principle be used for affordance extraction. The paper concludes with recommendations for further modification and improvement of the process.

URL: https://openreview.net/forum?id=shQup8NVKl

---

Title: Concave Utility Reinforcement Learning with Zero-Constraint Violations

Abstract: We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints.
For this, we propose a model-based learning algorithm that also achieves zero constraint violations. Assuming that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures, we solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We use Bellman error-based analysis for tabular infinite-horizon setups which allows analyzing stochastic policies.
Combining the Bellman error-based analysis and tighter optimization equation, for $T$ interactions with the environment, we obtain a high-probability regret guarantee for objective which grows as $\Tilde{O}(1/\sqrt{T})$, excluding other factors. The proposed method can be applied for optimistic algorithms to obtain high-probability regret bounds and also be used for posterior sampling algorithms to obtain a loose Bayesian regret bounds but with significant improvement in computational complexity.

URL: https://openreview.net/forum?id=WXVkgkPXRk

---

Title: Object-aware Cropping for Self-Supervised Learning

Abstract: A core component of the recent success of self-supervised learning is cropping data augmentation, which selects sub-regions of an image to be used as positive views in the self-supervised
loss. The underlying assumption is that randomly cropped and resized regions of a given
image share information about the objects of interest, which is captured by the learned
representation. This assumption is mostly satisfied in datasets such as ImageNet where
there is a large, centered object, which is highly likely to be present in random crops of
the full image. However, in other datasets such as OpenImages or COCO, which are more
representative of real world uncurated data, there are typically multiple small objects in
an image. In this work, we show that self-supervised learning based on the usual random
cropping performs poorly on such datasets (measured by the difference from fully-supervised
learning). Instead of using pairs of random crops, we propose to leverage an unsupervised
object proposal technique; the first view is a crop obtained from this algorithm, and the
second view is a dilated version of the first view. This encourages the self-supervised model
to learn both object and scene level semantic representations. Using this approach, which we
call object-aware cropping, results in significant improvements over random scene cropping on
classification and object detection benchmarks. For example, for pre-training on OpenImages,
our approach achieves an improvement of 8.8% mAP over random scene cropping (both meth-
ods using MoCo-v2). We also show significant improvements on COCO and PASCAL-VOC
object detection and segmentation tasks over the state-of-the-art self-supervised learning
approaches. Our approach is efficient, simple and general, and can be used in most existing
contrastive and non-contrastive self-supervised learning frameworks.

URL: https://openreview.net/forum?id=WXgJN7A69g

---

Title: Improving Generalization with Approximate Factored Value Functions

Abstract: Reinforcement learning in general unstructured MDPs presents a challenging learning problem. However, certain MDP structures, such as factorization, are known to simplify the learning problem. This fact is often not useful in complex tasks with high-dimensional state spaces which do not usually exhibit such structure, and even if the structure is present, it is typically unknown. In this work, we instead turn this observation on its head. Instead of developing algorithms for structured MDPs, we propose a representation learning algorithm that approximates an unstructured MDP with one that has factorized structure. We then use these factors as a more convenient representation of the state for downstream learning. The particular structure that we leverage is reward factorization, which defines a more compact class of MDPs that admit factorized value functions. We empirically verify the effectiveness of our approach in terms of faster training (better sample complexity) and robust zero-shot transfer (better generalization) on the ProcGen benchmark and the MiniGrid environments.

URL: https://openreview.net/forum?id=LwEWrrKyja

---

Title: Complex-Valued Autoencoders for Object Discovery

Abstract: Object-centric representations form the basis of human perception, and enable us to reason about the world and to systematically generalize to new settings. Currently, most works on unsupervised object discovery focus on slot-based approaches, which explicitly separate the latent representations of individual objects. While the result is easily interpretable, it usually requires the design of involved architectures. In contrast to this, we propose a comparatively simple approach - the Complex AutoEncoder (CAE) - that creates distributed object-centric representations. Following a coding scheme theorized to underlie object representations in biological neurons, its complex-valued activations represent two messages: their magnitudes express the presence of a feature, while the relative phase differences between neurons express which features should be bound together to create joint object representations. In contrast to previous approaches using complex-valued activations for object discovery, we present a fully unsupervised approach that is trained end-to-end - resulting in significant improvements in performance and efficiency. Further, we show that the CAE achieves competitive or better unsupervised object discovery performance on simple multi-object datasets compared to a state-of-the-art slot-based approach while being up to 100 times faster to train.

URL: https://openreview.net/forum?id=1PfcmFTXoa

---

Title: Task-Agnostic Continual Reinforcement Learning: In Praise of a Simple Baseline

Abstract: We study task-agnostic continual reinforcement learning (TACRL) in which standard RL challenges are compounded with \emph{partial observability} stemming from task agnosticism, as well as additional difficulties of continual learning (CL), i.e., learning on a non-stationary sequence of tasks. Here we compare TACRL methods with their soft upper bounds prescribed by previous literature: multi-task learning (MTL) methods which do not have to deal with non-stationary data distributions, as well as task-aware methods, which are allowed to operate under \emph{full observability}. We consider a previously unexplored and straightforward baseline for TACRL, replay-based recurrent RL (3RL), in which we augment an RL algorithm with recurrent mechanisms to address partial observability and experience replay mechanisms to address catastrophic forgetting in CL.

Studying empirical performance in a sequence of RL tasks, we find surprising occurrences of 3RL matching and overcoming the MTL and task-aware soft upper bounds. We lay out hypotheses that could explain this inflection point of continual and task-agnostic learning research. Our hypotheses are empirically tested in continuous control tasks via a large-scale study of the popular multi-task and continual learning benchmark Meta-World. By analyzing different training statistics including gradient conflict, we find evidence that 3RL's outperformance stems from its ability to quickly infer how new tasks relate with the previous ones, enabling forward transfer.

URL: https://openreview.net/forum?id=osvYzEsa9L

---

Title: Limitations of Neural Collapse for Understanding Generalization in Deep Learning

Abstract: The recent work of Papyan, Han, and Donoho (2020) presented an intriguing “Neural Collapse” phenomenon, showing a structural property of interpolating classifiers in the late stage of training. This opened a rich area of exploration studying this phenomenon. Our motivation is to study the upper limits of this research program: How far will understanding Neural Collapse take us in understanding deep learning? First, we investigate its role in generalization. We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property). We find that while Neural Collapse often occurs on the train set, it does not occur on the test set. We thus conclude that Neural Collapse is primarily an optimization phenomenon, with as-yet-unclear connections to generalization. Second, we investigate the role of Neural Collapse in representation learning. We show simple, realistic experiments where more collapse leads to worse last-layer features, as measured by transfer-performance on a downstream task. This suggests that Neural Collapse is not always desirable for representation learning, as previously claimed. We hope our work encourages the community to continue the rich line of Neural Collapse research, while also considering its inherent limitations.

URL: https://openreview.net/forum?id=FFqMVz0oQm

---

Title: Gradient-Free Minimax Optimization: Variance Reduction and Faster Convergence

Abstract: Many important machine learning applications amount to solving minimax optimization problems, and in many cases there is no access to the gradient information, but only the function values. In this paper, we focus on such a gradient-free setting, and consider the nonconvex-strongly-concave minimax stochastic optimization problem. In the literature, various zeroth-order (i.e., gradient-free) minimax methods have been proposed, but none of them achieve the potentially feasible computational complexity of $\mathcal{O}(\epsilon^{-3})$ suggested by the stochastic nonconvex minimization theorem. In this paper, we adopt the variance reduction technique to design a novel zeroth-order variance reduced gradient descent ascent (ZO-VRGDA) algorithm. We show that the ZO-VRGDA algorithm achieves the best known query complexity of $\mathcal{O}(\kappa(d_1 + d_2)\epsilon^{-3})$, which outperforms all previous complexity bounds by orders of magnitude, where $d_1$ and $d_2$ denote the dimensions of the optimization variables and $\kappa$ denotes the condition number. In particular, with a new analysis technique that we develop, our result does not rely on a diminishing or accuracy-dependent stepsize usually required in the existing methods. To our best knowledge, this is the first study of zeroth-order minimax optimization with variance reduction. Experimental results on the black-box distributional robust optimization problem demonstrates the advantageous performance of our new algorithm.

URL: https://openreview.net/forum?id=8JJf0qKY1j

---

Title: What Should a (Future) Deep Learning Theory Look Like? A Phenomenological Perspective

Abstract: To advance deep learning methodologies in the next decade, a theoretical framework for reasoning about modern neural networks is needed. While efforts are increasing toward demystifying why deep learning is so effective, a comprehensive picture remains lacking, suggesting that a better theory is possible. We argue that a future deep learning theory should inherit three characteristics: a \textit{hierarchically} structured network architecture, parameters \textit{iteratively} optimized using stochastic gradient-based methods, and information from the data that evolves \textit{compressively}. As an instantiation, we integrate these characteristics into a graphical model called \textit{neurashed}. This model effectively explains some common empirical patterns in deep learning. In particular, neurashed enables insights into implicit regularization, information bottleneck, and local elasticity. Finally, we discuss how neurashed can guide the development of deep learning theories.

URL: https://openreview.net/forum?id=fgCvXrPfgV

---

Title: Multimodal Language Learning in the Face of Missing Modalities

Abstract: We propose extended multimodal alignment (EMMA), a generalized geometric method combined with a cross-entropy loss function that can be used to learn retrieval models that incorporate an arbitrary number of views of a particular piece of data, compounded by the challenge of retrieval when a modality becomes unavailable. Our study is motivated by needs in robotics and computer-human interaction, where an agent has many sensors and thus modalities with which a human may interact, both to communicate a desired goal and for the agent to recognize a desired target object. For such problems, there has been little research on integrating more than two modalities. While there have been widely popular works on self-supervised contrastive learning based on cross-entropy, there is an entirely separate family of approaches based on explicit geometric alignment. However, to the best of our knowledge there has been no work on combining the two approaches for multimodal learning. We propose to combine both families of approaches and argue that the two are complementary. We demonstrate the usability of our model on a grounded language object retrieval scenario, where an intelligent agent has to select an object given an unconstrained language command. We leverage four modalities including vision, depth sensing, text, and speech, and we show that our model converges approximately 5 times faster than previous strong baselines, and out-performs or is strongly competitive with state-of-the-art contrastive learning. The code is publicly available on GitHub and will be included for the camera-ready version (it is redacted for anonymity).

URL: https://openreview.net/forum?id=6kwnAU9J8H

---

Title: A connection between sparse and low-rank matrices and its application to manifold learning

Abstract: We consider when a sparse nonnegative matrix $\mathbf{S}$ can be recovered from a real-valued matrix~$\mathbf{L}$ of significantly lower rank. Of particular interest is the setting where the positive elements of $\mathbf{S}$ encode the similarities of nearby points on a low dimensional manifold. The recovery can then be posed as a problem in manifold learning---namely, how to learn a similarity-preserving mapping of high dimensional inputs into a lower dimensional space. We describe an algorithm for this problem based on a generalized low-rank decomposition of sparse matrices. This decomposition has the interesting property that it can be encoded by a neural network with one layer of rectified linear units; since the algorithm discovers this encoding, it can also be viewed as a layerwise primitive for deep learning. The algorithm regards the inputs $\mathbf{x}_i$ and $\mathbf{x}_j$ as similar whenever the cosine of the angle between them exceeds some threshold $\kappa\in(0,1)$. Given this threshold, the algorithm attempts to discover a mapping $\mathbf{x}_i\mapsto\mathbf{y}_i$ by matching the elements of two sparse matrices; in particular, it seeks a mapping for which $\mathbf{S}=\max(0,\mathbf{L})$, where $S_{ij} = \max(0,\mathbf{x}_i\!\cdot\!\mathbf{x}_j\! -\! \kappa\|\mathbf{x}_i\|\|\mathbf{x}_j\|)$ and $L_{ij} = \mathbf{y}_i\!\cdot\!\mathbf{y}_j\! -\! \kappa\|\mathbf{y}_i\|\|\mathbf{y}_j\|$. We apply the algorithm to data sets where vector magnitudes and small cosine distances have interpretable meanings (e.g., the brightness of an image, the similarity to other words). On these data sets, the algorithm is able to discover much lower dimensional representations that preserve these meanings.

URL: https://openreview.net/forum?id=p8gncJbMit

---

Title: Input Invex Neural Network

Abstract: Connected decision boundaries are used in different areas like image segmentation, clustering, alpha-shape or defining a region in nD-space. However, methods for generating such connected decision boundaries using neural networks are lacking in the machine learning literature. While exploring such methods, we found that such decision boundaries can be generated by thresholding a special kind of function called an invex function. We find a connection between invex functions and the connectedness of regions and manifolds, and we apply the connectedness and locality as a foundation for interpreting the nD-data-space. In this paper, we present two methods for constructing invex function using neural networks. The first one is based on intuitions developed visually and constraining the function using our method (Gradient Clipped-Gradient Penalty). The second one is based on later findings on the relationship of invex function to the composition of invertible and convex functions. Using connectedness as a basic interpretation method, we create connected region based classifiers. We show that multiple connected set based classifiers can approximate any classification function. In the experiments section, we first use the invex function for regression and classification tasks to visualize the global optimality and connected set in 2D toy datasets. Furthermore, we use our methods for classification tasks using an ensemble of models as well as using a single model on larger-scale datasets. The experiments show that connected set based classifiers do not have a significant disadvantage over ordinary neural network classifiers. We also evaluate various properties of invex function and connected sets. The overall exploration of this work suggests that invex function is fundamental to understanding and applying locality and connectedness of input space which is useful for multiple tasks.

URL: https://openreview.net/forum?id=QrwalLdjZN

---

Title: A Snapshot of the Frontiers of Client Selection in Federated Learning

Abstract: Federated learning (FL) has been proposed as a privacy-preserving approach in distributed machine learning. A federated learning architecture consists of a central server and a number of clients that have access to private, potentially sensitive data. Clients are able to keep their data in their local machines and only share their locally trained model's parameters with a central server that manages the collaborative learning process. FL has delivered promising results in real-life scenarios, such as healthcare, energy and finance. However, when the number of participating clients is large, the overhead of managing the clients slows down the learning. Thus, client selection has been introduced as a strategy to limit the number of communicating parties at every step of the process. Since the early naïve random selection of clients, several client selection methods have been proposed in the literature. Unfortunately, given that this is an emergent field, there is a lack of a taxonomy of client selection methods, making it hard to compare approaches. In this paper, we propose a taxonomy of client selection in Federated Learning that enables us to shed light on current progress in the field and identify potential areas of future research in this promising area of machine learning.

URL: https://openreview.net/forum?id=vwOKBldzFu

---

Title: Benchmarks and Algorithms for Offline Preference-Based Apprenticeship Learning

Abstract: Learning a reward function from human preferences is challenging as it typically requires having a high-fidelity simulator or using expensive and potentially unsafe actual physical rollouts in the environment. However, in many tasks the agent might have access to offline data from related tasks in the same target environment. While offline data is increasingly being used to aid policy optimization via offline RL, our observation is that it can be a surprisingly rich source of information for preference learning as well. We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning, learns a distribution over reward functions, and optimizes a corresponding policy via offline RL. Crucially, our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps. To test our approach, we first evaluate existing offline RL benchmarks for their suitability for offline reward learning. Surprisingly, for many offline RL domains, we find that simply using a trivial reward function results good policy performance, making these domains ill-suited for evaluating learned rewards. To address this, we identify a subset of existing offline RL benchmarks that are well suited for offline reward learning and also propose new offline apprenticeship learning benchmarks which allow for more open-ended behaviors. When evaluated on this curated set of domains, our empirical results suggest that combining offline RL with learned human preferences can enable an agent to learn to perform novel tasks that were not explicitly shown in the offline data.

URL: https://openreview.net/forum?id=TGuXXlbKsn

---

Title: Calibrated Selective Classification

Abstract: Selective classification allows models to abstain from making predictions (e.g., say ``I don't know'') when in doubt in order to obtain better effective accuracy. While typical selective models can succeed at producing more accurate predictions on average, they may still allow for wrong predictions that have high confidence, or skip correct predictions that have low confidence. Providing calibrated uncertainty estimates alongside predictions---probabilities that correspond to true frequencies---can be as important as having predictions that are simply accurate on average. Uncertainty estimates, however, can sometimes be unreliable. In this paper, we develop a new approach to selective classification in which we propose a method for rejecting examples with ``uncertain'' uncertainties. By doing so, we aim to make predictions with well-calibrated uncertainty estimates over the distribution of accepted examples, a property we call selective calibration. We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model. In particular, our work focuses on achieving robust calibration, where the model is intentionally designed to be tested on out-of-domain data. We achieve this through a training strategy inspired by distributionally robust optimization, in which we apply simulated input perturbations to the known, in-domain training data. We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.

URL: https://openreview.net/forum?id=zFhNBs8GaV

---

Title: Decentralized Feature-Distributed Optimization for Generalized Linear Models

Abstract: We consider the ``all-for-one'' decentralized learning problem for generalized linear models. The features of each sample are partitioned among several collaborating agents in a connected network, but only one agent observes the response variables. To solve the regularized empirical risk minimization in this distributed setting, we apply the Chambolle--Pock primal--dual algorithm to an equivalent saddle-point formulation of the problem. The primal and dual iterations are either in closed-form or reduce to coordinate-wise minimization of scalar convex functions. We establish convergence rates for the empirical risk minimization under two different assumptions on the loss funtion (Lipschitz and square root Lipschitz), and show how they depend on the characteristics of the design matrix and the Laplacian of the network.

URL: https://openreview.net/forum?id=82JahDJRpO

---

Title: SMILE: Sample-to-feature Mixup for Efficient Transfer Learning

Abstract: To improve the performance of deep learning, mixup has been proposed to force the neural networks favoring simple linear behaviors in-between training samples. Performing mixup for transfer learning with pre-trained models however is not that simple, a high capacity pre-trained model with a large fully-connected (FC) layer could easily overfit to the target dataset even with samples-to-labels mixed up. In this work, we propose SMILE — Sample-to-feature Mixup for Efficient Transfer Learning. With mixed images as inputs, SMILE regularizes the outputs of CNN feature extractors to learn from the mixed feature vectors of inputs, in addition to the mixed labels. SMILE incorporates a mean teacher to provide the surrogate "ground truth" for mixed feature vectors. The sample-to-feature mixup regularizer is imposed both on deep features for the target domain and classifier outputs for the source domain, bounding the linearity in-between samples for target tasks. Extensive experiments have been done to verify the performance improvement made by SMILE, in comparisons with a wide spectrum of transfer learning algorithms, including fine-tuning, L$^2$-SP, DELTA, BSS, RIFLE, Co-Tuning and RegSL, even with mixup strategies combined. Ablation studies show that the vanilla sample-to-label mixup strategies could marginally increase the linearity in-between training samples but lack of generalizability, while SMILE significantly improves the mixup effects in both label and feature spaces with both training and testing datasets. The empirical observations backup our design intuition and purposes.

URL: https://openreview.net/forum?id=czgMCpvrDM

---

Title: Unsupervised Domain Adaptation via Minimized Joint Error

Abstract: Unsupervised domain adaptation transfers knowledge from a learned source domain to a different target distribution, for which only few or no labeled data is available. Some researchers proposed upper bounds for the target error when transferring the knowledge, i.e., Ben-David et al. (2010) established a theory based on minimizing the source error and distance between marginal distributions simultaneously. However, in most works the joint error is usually ignored due to the intractability. In this paper, we argue that the joint error is essential for the domain adaptation problem, in particular if the samples from different classes in source/target are closely aligned when matching the marginal distributions due to a large domain gap. To tackle this problem, we propose a novel objective that relates to an upper bound of the joint error. Moreover, we adopt a source/pseudo-target labels induced hypothesis space that can reduce the searching space to further tighten up this bound. For the dissimilarity measurement between hypotheses, we propose a novel cross margin discrepancy to alleviate the instability during adversarial learning. In addition, we present extensive empirical evidence that shows that our proposal boosts the performance in image classification accuracy on standard domain adaptation benchmarks.

URL: https://openreview.net/forum?id=eOh0SICZOb

---

Title: Logistic Regression Through the Veil of Imprecise Data

Abstract: Logistic regression is a popular supervised learning algorithm used to assess the probability of a variable having a binary label based on some predictive features. Standard methods can only deal with precisely known data; however, many datasets have uncertainties that traditional methods either reduce to a single point or completely disregard. This paper shows that it is possible to include these uncertainties by considering an imprecise logistic regression model using the set of possible models obtained from values within the intervals. This method can express the epistemic uncertainty neglected by traditional methods.

URL: https://openreview.net/forum?id=757e7RidmH

---

Title: ARMA Cell: A Modular and Effective Approach for Neural Autoregressive Modeling

Abstract: The autoregressive moving average (ARMA) model is a classical, and arguably one
of the most studied approaches to model time series data.
It has compelling theoretical properties and is widely used among practitioners.
More recent deep learning approaches popularize recurrent neural networks (RNNs)
and, in particular, long short-term memory (LSTM) cells that have become one of
the best performing and most common building blocks in neural time series modeling.
While advantageous for time series data or sequences with long-term effects,
complex RNN cells are not always a must and can sometimes even be inferior to
simpler recurrent approaches. In this work, we introduce the ARMA cell,
a simpler, modular and effective approach for time series modeling in neural
networks. This cell can be used in any neural network architecture where
recurrent structures are present and naturally handles multivariate time series
using vector autoregression. We also introduce the ConvARMA cell as a natural
successor for spatially-correlated time series. Our experiments show that the
proposed methodology is competitive with popular alternatives in terms of
performance while being more robust and compelling due to its simplicity.

URL: https://openreview.net/forum?id=F0sHMGzSCv

---

Title: Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization

Abstract: We present a theoretical study of server-side optimization in federated learning. Our results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is very useful in the context of Federated Averaging (FedAvg) with local passes over the client data. Each local pass is performed without replacement using Random Reshuffling, which is a key reason we can show improved complexities. In particular, we prove that whenever the local stepsizes are small, and the update direction is given by FedAvg in conjunction with Random Reshuffling over all clients, one can take a big leap in the obtained direction and improve rates for convex, strongly convex, and non-convex objectives. In particular, in non-convex regime we get an enhancement of the rate of convergence from $\mathcal{O}\left(\varepsilon^{-3}\right)$ to $\mathcal{O}\left(\varepsilon^{-2}\right)$. This result is new even for Random Reshuffling performed on a single node. In contrast, if the local stepsizes are large, we prove that the noise of client sampling can be controlled by using a small server-side stepsize. To the best of our knowledge, this is the first time that local steps provably help to overcome the communication bottleneck. Together, our results on the advantage of large and small server-side stepsizes give a formal justification for the practice of adaptive server-side optimization in federated learning. Moreover, we consider a variant of our algorithm that supports partial client participation, which makes the method more practical.

URL: https://openreview.net/forum?id=K796fgqQdh

---

Reply all

Reply to author

Forward

0 new messages