Weekly TMLR digest for Apr 08, 2022

已查看 9 次

跳至第一个未读帖子

TMLR

未读，

2022年4月8日 10:43:432022/4/8

收件人 tmlr-annou...@googlegroups.com

New submissions
===============

Title: Flipped Classroom: Effective Teaching for Chaotic Time Series Forecasting

Abstract: Gated RNNs like LSTM and GRU are the most common choice for forecasting time series data reaching state of the art performance. Training such sequence-to-sequence RNNs models can be delicate though. While gated RNNs effectively tackle exploding and vanishing gradients, there remains the exposure bias problem provoked by training sequenceto- sequence models with teacher forcing. Exposure bias is a concern in natural language processing (NLP) as well and there are already plenty of studies that propose solutions, the most prominent probably being scheduled sampling. For time series forecasting, though, the most frequent suggestion is training the model in free running mode to stabilize its prediction capabilities over longer horizons. In this paper, we demonstrate that exposure bias is a serious problem even or especially outside of NLP and that training such models free running is only sometimes successful. To fill the gap, we are formalizing curriculum learning (CL) strategies along the training as well as the training iteration scale, we propose several completely new curricula, and systematically evaluate their performance in two experimental sets. We utilize six prominent chaotic dynamical systems for these experiments. We found that the newly proposed increasing training scale curricula with a probabilistic iteration scale curriculum consistently outperforms previous training strategies yielding an NRMSE improvement up to 81% over free running or teacher forced training. For some datasets we additionally observe a reduced number of training iterations and all models trained with the new curricula yield higher prediction stability allowing for longer prediction horizons.

URL: https://openreview.net/forum?id=vTmgKhvMfB

---

Title: Remember to correct the bias when using deep learning for regression!

Abstract: When training deep learning models for least-squares regression, we cannot expect that the training error residuals of the final model, selected after a fixed training time or based on performance on a hold-out data set, sum to zero. This can introduce a systematic error that accumulates if we are interested in the total aggregated performance over many data points. We suggest adjusting the bias of the machine learning model after training as a default post-processing step, which efficiently solves the problem. The severeness of the error accumulation and the effectiveness of the bias correction are demonstrated in exemplary experiments.

URL: https://openreview.net/forum?id=qM1Bsh3Oyt

---

Title: Non-Parametric Domain Adaptation Layer

Abstract: Normalization methods spurred the development of increasingly deep and efficient architectures as they reduce the distributions change during optimization, allowing for efficient training. However, most normalization methods cannot account for test-time distribution changes, increasing the vulnerability of the network concerning noise and input corruptions.
As noise is ubiquitous and diverse in many applications, machine learning systems often fail drastically as they cannot cope with mismatches between training- and test-time activation distributions.
The most common normalization method, batch normalization, is agnostic to changes in the input distribution during test time. This makes batch normalization prone to performance degradation whenever noise is present during test-time. Parametric correction schemes can only adjust for linear transformations of the activation distribution but not for changes in the distribution shape; this makes the network vulnerable to distribution changes that cannot be reflected in the normalization parameters. We propose an unsupervised non-parametric distribution correction layer that adapts the activation distribution and reduces the mismatch between the training and test-time distribution by minimizing the Wasserstein distance of each layer. We empirically show that the proposed method effectively improves the classification performance without the need for retraining or fine-tuning the model; on ImageNet-C it achieves up to 11 % improvement in Top-1 accuracy.

URL: https://openreview.net/forum?id=SVnsb99sXL

---

Title: AMD: Angular Margin based Knowledge Distillation

Abstract: Knowledge distillation as a broad class of methods has led to the development of lightweight and memory efficient models, using a pre-trained model with a large capacity (teacher network) to train a smaller model (student network). Recently, additional variations for knowledge distillation, utilizing activation maps of intermediate layers as the source of knowledge, have been studied. Generally, in computer vision applications it is seen that the feature activation learned by a higher-capacity model contains richer knowledge, highlighting complete objects while focusing less on the background. Based on this observation, we leverage the teacher’s dual ability to accurately distinguish between positive (relevant to the target object) and negative (irrelevant) areas. We propose a new type of distillation, called angular margin-based distillation (AMD). AMD uses the angular distance between positive and negative features by projecting them onto a hypersphere, motivated by the near angular distributions seen in many feature extractors. Then, we create a more attentive feature from encoded knowledge by the angular distance by introducing an angular margin to the positive feature. Transferring such knowledge from the teacher network enables the student model to harness the teacher’s better discrimination of positive and negative features, thus distilling superior student models. The proposed method is evaluated for various student-teacher network pairs on three public datasets. Furthermore, we show that the proposed method has advantages in compatibility with other learning techniques, such as using fine-grained features, augmentation, and other distillation methods.

URL: https://openreview.net/forum?id=lsnTlqxtZx

---

Title: NoiLin: Improving adversarial training and correcting stereotype of noisy labels

Abstract: Adversarial training (AT) aims to fit the neighborhood of natural data by generating the epoch-wise adversarial variants for the learning. In practice, data of different classes may inherently have some overlaps, and AT with larger neighborhoods inevitably encounters the larger overlaps, which inevitably requires some randomness in labels to cover the overlaps. However, the existing AT methods, that mainly focused on manipulating the inner maximization for generating quality adversarial variants or manipulating the outer minimization for designing effective learning objectives, rarely studied manipulating labels for benefits. In this paper, we study the label randomness in AT.
First, we thoroughly investigate noisy labels (NLs) injection into AT's inner maximization and outer minimization, respectively and obtain some observations on when NL injection benefits AT. Second, based on the observations, we propose a simple but effective method---NoiLIn that randomly injects NLs into training data at each training epoch and dynamically increases the NL injection rate once robust overfitting occurs. Empirically, NoiLIn can significantly mitigate the AT's undesirable issue of robust overfitting and even further improve the generalization of the state-of-the-art AT methods. Philosophically, NoiLIn sheds light on a new perspective of learning with NLs: NLs should not always be deemed detrimental, and even in the absence of NLs in the training set, we may consider injecting them deliberately.

URL: https://openreview.net/forum?id=zlQXV7xtZs

---

Title: Global Reward Maximization with Partial Feedback via Differentially Private Distributed Linear Bandits

Abstract: In this paper, we study the problem of global reward maximization with only partial distributed feedback. This problem is motivated by several real-world applications (e.g., cellular network configuration, dynamic pricing, and policy selection) where an action was taken by a central entity that influences a large population that contributes to the global reward. However, collecting such reward feedback from the entire population not only incurs a prohibitively high cost but often leads to privacy concerns. To tackle this, we formulate it as a differentially private distributed linear bandits (DP-DLB), where only a subset of users from the population are selected (called clients) to participate in the learning process and the central server learns the global model from such partial feedback by iteratively aggregating these clients' local feedback in a differentially private fashion. We then propose a unified algorithmic learning framework, called differentially private distributed phased elimination (DP-DPE), which enables us to naturally integrate popular differential privacy (DP) models (including central DP, local DP, and shuffle DP) into the learning process. Furthermore, we analyze the performance of the DP-DPE algorithm and show that DP-DPE achieves both sublinear regret and sublinear communication cost. Interestingly, we highlight that DP-DPE allows us to achieve privacy protection ``for free'' as the additional cost due to privacy can be a lower-order additive term. Finally, we conduct simulations to corroborate our theoretical results and demonstrate the effectiveness of DP-DPE in terms of regret, communication cost, and privacy guarantees.

URL: https://openreview.net/forum?id=Cc5hHF5w3h

---

Title: Auto-Lambda: Disentangling Dynamic Task Relationships

Abstract: Understanding the structure of multiple related tasks allows for multi-task learning to improve the generalisation ability of one or all of them. However, it usually requires training each pairwise combination of tasks together in order to capture task relationships, at an extremely high computational cost. In this work, we learn task relationships via an automated weighting framework, named Auto-Lambda. Unlike previous methods where task relationships are assumed to be fixed, Auto-Lambda is a gradient-based meta learning framework which explores continuous, dynamic task relationships via task-specific weightings, and can optimise any choice of combination of tasks through the formulation of a meta-loss; where the validation loss automatically influences task weightings throughout training. We apply the proposed framework to both multi-task and auxiliary learning problems in computer vision and robotics, and show that Auto-Lambda achieves state-of-the-art performance, even when compared to optimisation strategies designed specifically for each problem and data domain. Finally, we observe that Auto-Lambda can discover interesting learning behaviors, leading to new insights in multi-task learning. Code is attached in the supplementary material.

URL: https://openreview.net/forum?id=KKeCMim5VN

---

Title: Multi-Agent Off-Policy TDC with Near-Optimal Sample and Communication Complexities

Abstract: The finite-time convergence of off-policy temporal difference (TD) learning has been comprehensively studied recently. However, such a type of convergence has not been established for off-policy TD learning in the multi-agent setting, which covers broader reinforcement learning applications and is fundamentally more challenging. This work develops a decentralized TD with correction (TDC) algorithm for multi-agent off-policy TD learning under Markovian sampling. In particular, our algorithm avoids sharing the actions, policies and rewards of the agents, and adopts mini-batch sampling to reduce the sampling variance and communication frequency. Under Markovian sampling and linear function approximation, we proved that the finite-time sample complexity of our algorithm for achieving an $\epsilon$-accurate solution is in the order of $\mathcal{O}\big(\frac{M\ln\epsilon^{-1}}{\epsilon(1-\sigma_2)^2}\big)$, where $M$ denotes the total number of agents and $\sigma_2$ is a network parameter. This matches the sample complexity of the centralized TDC. Moreover, our algorithm achieves the optimal communication complexity $\mathcal{O}\big(\frac{\sqrt{M}\ln\epsilon^{-1}}{1-\sigma_2}\big)$ for synchronizing the value function parameters, which is order-wise lower than the communication complexity of the existing decentralized TD(0). Numerical simulations corroborate our theoretical findings.

URL: https://openreview.net/forum?id=tnPjQpYk7D

---

Title: The Graph Cut Kernel for Ranked Data

Abstract: Many algorithms for ranked data become computationally intractable as the number of objects grows due to complex geometric structure induced by rankings. An additional challenge is posed by partial rankings, i.e. rankings in which the preference is only known for a subset of all objects. For these reasons, state-of-the-art methods cannot scale to real-world applications, such as recommender systems.
We address this challenge by exploiting geometric structure of ranked data and additional available information about the objects to derive a kernel for ranking based on the graph cut function. The graph cut kernel combines the efficiency of submodular optimization with the theoretical properties of kernel-based methods. We demonstrate that our novel kernel drastically reduces the computational cost while maintain the same accuracy of state-of-the-art methods.

URL: https://openreview.net/forum?id=SEUGkraMPi

---

Title: Momentum Capsule Networks

Abstract: Capsule networks are a class of neural networks that aim at solving some limiting factors of Convolutional Neural Networks. However, baseline capsule networks have failed to reach state-of-the-art results on more complex datasets due to the high computation and memory requirements. We tackle this problem by proposing a new network architecture, called Momentum Capsule Network (MoCapsNet). MoCapsNets are inspired by Momentum ResNets, a type of network that applies reversible residual building blocks. Reversible networks allow for recalculating activations of the forward pass in the backpropagation algorithm, so those memory requirements can be drastically reduced. In this paper, we provide a framework on how invertible residual building blocks can be applied to capsule networks. We will show that MoCapsNet beats the accuracy of baseline capsule networks on MNIST, SVHN and CIFAR-10 while using considerably less memory. The source code is available on https://redacted.

URL: https://openreview.net/forum?id=Su290sknyQ

---

Title: TLDR: Twin Learning for Dimensionality Reduction

Abstract: Dimensionality reduction methods are unsupervised approaches which learn low-dimensional spaces where some properties of the initial space, typically the notion of ``neighborhood'', are preserved. They are a crucial component of diverse tasks like visualization, compression, indexing, and retrieval. Aiming for a totally different goal, self-supervised visual representationlearning has been shown to produce transferable representation functions by learning models that encode invariance to artificially created distortions, e.g. a set of hand-crafted image transformations. Unlike manifold learning methods that usually require propagation on large k-NN graphs or complicated optimization solvers, self-supervised learning approaches rely on simpler and more scalable frameworks for learning.
In this paper, we unify these two families of approaches from the angle of manifold learning and propose TLDR, a dimensionality reduction method for generic input spaces that is porting the recent self-supervised learning framework of Zbontar et al (2021) to a setting where it is hard or impossible to define an appropriate set of distortions by hand. We propose to use nearest neighbors to build pairs from a training set and a redundancy reduction loss borrowed from the self-supervised literature to learn an encoder that produces representations invariant across such pairs. TLDR is a method that is simple, easy to implement and train, and of broad applicability; it consists of an offline nearest neighbor computation step that can be highly approximated, and a straightforward learning process that does not require mining negative samples to contrast, eigendecompositions, or cumbersome optimization solvers. Aiming for scalability, the Achilles' heel of manifold learning, we focus on improving linear dimensionality reduction, a technique that is still an integral part of many large-scale systems, and show consistent gains on image and document retrieval tasks, e.g. improving the performance of DINO on ImageNet or retaining it with a 10x compression, or gaining +4% mAP over PCA on ROxford for GeM-AP.

URL: https://openreview.net/forum?id=86fhqdBUbx

---

Title: How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Abstract: Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (``AugReg'' for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.

URL: https://openreview.net/forum?id=4nPswr1KcP

---

Title: A Unified Survey on Anomaly, Novelty, Open-Set, and Out- of-Distribution Detection: Solutions and Future Challenges

Abstract: Machine learning models often encounter samples that are diverged from the training distribution. Failure to recognize an out-of-distribution (OOD) sample, and consequently assign that sample to an in-class label significantly compromises the reliability of a model. The problem has gained significant attention due to its importance for safety deploying models
in open-world settings. Detecting OOD samples is challenging due to the intractability of modeling all possible unknown distributions. To date, several research domains tackle the problem of detecting unfamiliar samples, including anomaly detection, novelty detection,
one-class learning, open set recognition, and out-of-distribution detection. Despite having similar and shared concepts, out-of-distribution, open-set, and anomaly detection have been investigated independently. Accordingly, these research avenues have not cross-pollinated,
creating research barriers. While some surveys intend to provide an overview of these approaches, they seem to only focus on a specific domain without examining the relationship between different domains. This survey aims to provide a cross-domain and comprehensive
review of numerous eminent works in respective areas while identifying their commonalities. Researchers can benefit from the overview of research advances in different fields and develop future methodology synergistically. Furthermore, to the best of our knowledge, while there are surveys in anomaly detection or one-class learning, there is no comprehensive or up-to-date survey on out-of-distribution detection, which our survey covers extensively. Finally, having a unified cross-domain perspective, we discuss and shed light on future lines of research, intending to bring these fields closer together.

URL: https://openreview.net/forum?id=oRNFjEmjvV

---

Title: Clustering units in neural networks: upstream vs downstream information

Abstract: It has been hypothesized that some form of "modular" structure in artificial neural networks should be useful for learning, compositionality, and generalization. However, defining and quantifying modularity remains an open problem. We cast the problem of detecting functional modules into the problem of detecting clusters of similar-functioning units. This begs the question of what makes two units functionally similar. For this, we consider two broad families of methods: those that define similarity based on how units respond to structured variations in inputs ("upstream"), and those based on how variations in hidden unit activations affect outputs ("downstream"). We conduct an empirical study quantifying modularity of hidden layer representations of simple feedforward, fully connected networks, across a range of hyperparameters. For each model, we quantify pairwise associations between hidden units in each layer using a variety of both upstream and downstream measures, then cluster them by maximizing their "modularity score" using established tools from network science. We find two surprising results: first, dropout dramatically increased modularity, while other forms of weight regularization had more modest effects. Second, although we observe that there is usually good agreement about clusters within both upstream methods and downstream methods, there is little agreement about the cluster assignments across these two families of methods. This has important implications for representation-learning, as it suggests that finding modular representations that reflect structure in inputs (e.g. disentanglement) may be a distinct goal from learning modular representations that reflect structure in outputs (e.g. compositionality).

URL: https://openreview.net/forum?id=Euf7KofunK

---

Title: A Self-Supervised Framework for Function Learning and Extrapolation

Abstract: Understanding how agents learn to generalize — and, in particular, to extrapolate — in
high-dimensional, naturalistic environments remains a challenge for both machine learning
and the study of biological agents. One approach to this has been the use of function
learning paradigms, which allow agents’ empirical patterns of generalization for smooth
scalar functions to be described precisely. However, to date, such work has not succeeded
in identifying mechanisms that acquire the kinds of general purpose representations over
which function learning can operate to exhibit the patterns of generalization observed in
human empirical studies. Here, we present a framework for how a learner may acquire
such representations, that then support generalization-and extrapolation in particular-in a few-shot fashion.
Taking inspiration from a classic theory of visual processing, we
construct a self-supervised encoder that implements the basic inductive bias of invariance
under topological distortions. We show the resulting representations outperform those from
other models for unsupervised time series learning in several downstream function learning
tasks, including extrapolation.

URL: https://openreview.net/forum?id=ILPFasEaHA

---

Title: Greedy Bayesian Posterior Approximation with Deep Ensembles

Abstract: Ensembles of independently trained neural networks are a state-of-the-art approach to estimate predictive uncertainty in Deep Learning, and can be interpreted as an approximation of the posterior distribution via a mixture of delta functions. The training of ensembles relies on non-convexity of the loss landscape and random initialization of their individual members, making the resulting posterior approximation uncontrolled. This paper proposes a novel and principled method to tackle this limitation, minimizing an $f$-divergence between the true posterior and a kernel density estimator in a function space. We analyze this objective from a combinatorial point of view, and show that it is submodular with respect to mixture components for any $f$. Subsequently, we consider the problem of greedy ensemble construction, and from the marginal gain of the total objective, we derive a novel diversity term for ensemble methods. The performance of our approach is demonstrated on computer vision out-of-distribution detection benchmarks in a range of architectures trained on multiple datasets. The source code of our method is made publicly available.

URL: https://openreview.net/forum?id=P1DuPJzVTN

---

Title: Identifying Causal Structure in Dynamical Systems

Abstract: Mathematical models are fundamental building blocks in the design of dynamical control systems. As control systems are becoming increasingly complex and networked, approaches for obtaining such models based on first-principles reach their limits. An alternative is provided by data-driven methods. However, without structural knowledge, these methods are prone to finding spurious correlations in the training data, which can hamper generalization capabilities of the obtained models. This can significantly lower control and prediction performance when the system is exposed to unknown situations. A preceding causal identification can prevent this pitfall. In this paper, we propose a method that identifies the causal structure of control systems. We design experiments based on the concept of controllability, which provides a systematic way to compute input trajectories that steer the system to specific regions in its state space. We then analyze the resulting data leveraging powerful techniques from causal inference and extend them to control systems. Further, we derive conditions that guarantee the discovery of the true causal structure of the system. Experiments on a robot arm demonstrate reliable causal identification from real-world data and enhanced generalization capabilities.

URL: https://openreview.net/forum?id=X2BodlyLvT

---

Title: Boosting Search Engines with Interactive Agents

Abstract: This paper presents first successful steps in designing search agents that learn meta-strategies for iterative query refinement in information-seeking tasks. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results.
Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically constrained actions that learns interactive search strategies from scratch. Our search agents obtain retrieval and answer quality performance comparable to recent neural methods, using only a traditional term-based BM25 ranking function and interpretable discrete reranking and filtering actions.

URL: https://openreview.net/forum?id=0ZbPmmB61g

---

Title: Unsupervised Learning of Temporal Abstractions with Slot-based Transformers

Abstract: The discovery of reusable sub-routines simplifies decision-making and planning in complex reinforcement learning problems. Previous approaches propose to learn such temporal abstractions in a purely unsupervised fashion through observing state-action trajectories gathered from executing a policy. However, a current limitation is that they process each trajectory in an entirely sequential manner, which prevents them from revising earlier decisions about sub-routine boundary points in light of new incoming information. In this work we propose SloTTAr, a fully parallel approach that integrates sequence processing Transformers with a Slot Attention module and adaptive computation for learning about the number of such sub-routines in an unsupervised fashion. We demonstrate how SloTTAr is capable of outperforming strong baselines in terms of boundary point discovery, even for sequences containing variable amounts of sub-routines, while being up to $7\mathrm{x}$ faster to train on existing benchmarks.

URL: https://openreview.net/forum?id=VHIur3v08z

---

Title: Handwritten stroke augmentation on images

Abstract: In this paper, we introduce Handwritten stroke augmentation, a new data augmentation for handwritten character images. This method focuses on augmenting handwritten image data by altering the shape of input character strokes in training. The proposed handwritten augmentation is similar to position augmentation, color augmentation for images but a deeper focus on handwritten character strokes. Handwritten stroke augmentation is data-driven, easy to implement, and can be integrated with CNN-based optical character recognition models. Handwritten stroke augmentation can be implemented along with commonly used data augmentation techniques such as cropping, rotating, and yields better performance of models for handwritten image datasets developed using optical character recognition methods. Our source code will be available on GitHub.

URL: https://openreview.net/forum?id=8m0jGwezEg

---

Title: Understanding Linearity of Cross-Lingual Word Embedding Mappings

Abstract: The technique of Cross-Lingual Word Embedding (CLWE) plays a fundamental role in tackling Natural Language Processing challenges for low-resource languages. Its dominant approaches assumed that the relationship between embeddings could be represented by a linear mapping, but there has been no exploration of the conditions under which this assumption holds. Such a research gap becomes very critical recently, as it has been evidenced that relaxing mappings to be non-linear can lead to better performance in some cases. We, for the first time, present a theoretical analysis that identifies the preservation of analogies encoded in monolingual word embeddings as a *necessary and sufficient* condition for the ground-truth CLWE mapping between those embeddings to be linear. On a novel cross-lingual analogy dataset that covers five representative analogy categories for twelve distinct languages, we carry out experiments which provide direct empirical support for our theoretical claim. These results offer additional insight into the observations of other researchers and contribute inspiration for the development of more effective cross-lingual representation learning strategies.

URL: https://openreview.net/forum?id=8HuyXvbvqX

---

Title: Deep Learning for Bayesian Optimization of Scientific Problems with High-Dimensional Structure

Abstract: Bayesian optimization (BO) is a popular paradigm for global optimization of expensive black-box functions, but there are many domains where the function is not completely black-box. The data may have some known structure (e.g. symmetries) and/or the data generation process can yield useful intermediate or auxiliary information in addition to the value of the optimization objective. However, surrogate models traditionally employed in BO, such as Gaussian Processes (GPs), scale poorly with dataset size and do not easily accommodate known structure or auxiliary information. Instead, we propose performing BO on complex, structured problems by using deep learning models with uncertainty, a class of scalable surrogate models that have the representation power and flexibility to handle structured data and exploit auxiliary information. We demonstrate BO on a number of realistic problems in physics and chemistry, including topology optimization of photonic crystal materials using convolutional neural networks, and chemical property optimization of molecules using graph neural networks. On these complex tasks, we show that neural networks often outperform GPs as surrogate models for BO in terms of both sampling efficiency and computational cost.

URL: https://openreview.net/forum?id=sFGmQZ7GQf

---

回复全部

回复作者

0 个新帖子