Weekly TMLR digest for Mar 19, 2023

10 views

Skip to first unread message

TMLR

unread,

Mar 18, 2023, 8:00:10 PM3/18/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: Patches Are All You Need?

Asher Trockman, J Zico Kolter

https://openreview.net/forum?id=rAnB7JSMXL

---

Reproducibility Certification: PRUDEX-Compass: Towards Systematic Evaluation of Reinforcement Learning in Financial Markets

Shuo Sun, Molei Qin, Xinrun Wang, Bo An

https://openreview.net/forum?id=JjbsIYOuNi

---

Accepted papers
===============

Title: Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks

Authors: Sadegh Mahdavi, Kevin Swersky, Thomas Kipf, Milad Hashemi, Christos Thrampoulidis, Renjie Liao

Abstract: In this paper, we study the OOD generalization of neural algorithmic reasoning tasks, where the goal is to learn an algorithm (e.g., sorting, breadth-first search, and depth-first search) from input-output pairs using deep neural networks. First, we argue that OOD generalization in this setting is significantly different than common OOD settings. For example, some phenomena in OOD generalization of image classifications such as \emph{accuracy on the line} are not observed here, and techniques such as data augmentation methods do not help as assumptions underlying many augmentation techniques are often violated. Second, we analyze the main challenges (e.g., input distribution shift, non-representative data generation, and uninformative validation metrics) of the current leading benchmark, i.e., CLRS \citep{deepmind2021clrs}, which contains 30 algorithmic reasoning tasks. We propose several solutions, including a simple-yet-effective fix to the input distribution shift and improved data generation. Finally, we propose an attention-based 2WL-graph neural network (GNN) processor which complements message-passing GNNs so their combination outperforms the state-of-the-art model by a $3\%$ margin averaged over all algorithms.

URL: https://openreview.net/forum?id=xkrtvHlp3P

---

Title: L-SVRG and L-Katyusha with Adaptive Sampling

Authors: Boxin Zhao, Boxiang Lyu, mladen kolar

Abstract: Stochastic gradient-based optimization methods, such as L-SVRG and its accelerated variant L-Katyusha (Kovalev et al., 2020), are widely used to train machine learning models. Theoretical and empirical performance of L-SVRG and L-Katyusha can be improved by sampling the observations from a non-uniform distribution Qian et al. (2021). However, to design a desired sampling distribution, Qian et al. (2021) rely on prior knowledge of smoothness constants that can be computationally intractable to obtain in practice when the dimension of the model parameter is high. We propose an adaptive sampling strategy for L-SVRG and L-Katyusha that learns the sampling distribution with little computational overhead, while allowing it to change with iterates, and at the same time does not require any prior knowledge on the problem parameters. We prove convergence guarantees for L-SVRG and L-Katyusha for convex objectives when the sampling distribution changes with iterates. These results show that even without prior information, the proposed adaptive sampling strategy matches, and in some cases even surpasses, the performance of the sampling scheme in Qian et al. (2021). Extensive simulations support our theory and the practical utility of the proposed sampling scheme on real data.

URL: https://openreview.net/forum?id=9lyqt3rbDc

---

Title: Quantum Policy Iteration via Amplitude Estimation and Grover Search – Towards Quantum Advantage for Reinforcement Learning

Authors: Simon Wiedemann, Daniel Hein, Steffen Udluft, Christian B. Mendl

Abstract: We present a full implementation and simulation of a novel quantum reinforcement learning method. Our work is a detailed and formal proof of concept for how quantum algorithms can be used to solve reinforcement learning problems and shows that, given access to error- free, efficient quantum realizations of the agent and environment, quantum methods can yield provable improvements over classical Monte-Carlo based methods in terms of sample complexity. Our approach shows in detail how to combine amplitude estimation and Grover search into a policy evaluation and improvement scheme. We first develop quantum policy evaluation (QPE) which is quadratically more efficient compared to an analogous classi- cal Monte Carlo estimation and is based on a quantum mechanical realization of a finite Markov decision process (MDP). Building on QPE, we derive a quantum policy iteration that repeatedly improves an initial policy using Grover search until the optimum is reached. Finally, we present an implementation of our algorithm for a two-armed bandit MDP which we then simulate.

URL: https://openreview.net/forum?id=HG11PAmwQ6

---

Title: Improved Overparametrization Bounds for Global Convergence of SGD for Shallow Neural Networks

Authors: Bartłomiej Polaczyk, Jacek Cyranka

Abstract: We study the overparametrization bounds required for the global convergence of stochastic gradient descent algorithm for a class of one hidden layer feed-forward neural networks equipped with ReLU activation function. We improve the existing state-of-the-art results in terms of the required hidden layer width. We introduce a new proof technique combining nonlinear analysis with properties of random initializations of the network.

URL: https://openreview.net/forum?id=RjZq6W6FoE

---

Title: Patches Are All You Need?

Authors: Asher Trockman, J Zico Kolter

Abstract: Although convolutional neural networks have been the dominant architecture for computer vision for many years, Vision Transformers (ViTs) have recently shown promise as an alternative. Subsequently, many new models have been proposed which replace the self-attention layer within the ViT architecture with novel operations (such as MLPs), all of which have also been relatively performant. We note that these architectures all share a common component--the patch embedding layer--which enables the use of a simple isotropic template with alternating steps of channel- and spatial-dimension mixing. This raises a question: is the success of ViT-style models due to novel, highly-expressive operations like self-attention, or is it at least in part due to using patches? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple and parameter-efficient fully-convolutional model in which we replace the self-attention and MLP layers within the ViT with less-expressive depthwise and pointwise convolutional layers, respectively. Despite its unusual simplicity, ConvMixer outperforms the ViT, MLP-Mixer, and their variants for similar data set sizes and parameter counts, in addition to outperforming classical vision models like ResNet. We argue that this contributes to the evidence that patches are sufficient for designing simple and effective vision models. Our code is available at https://github.com/locuslab/convmixer.

URL: https://openreview.net/forum?id=rAnB7JSMXL

---

Title: Enhancing Diffusion-Based Image Synthesis with Robust Classifier Guidance

Authors: Bahjat Kawar, Roy Ganz, Michael Elad

Abstract: Denoising diffusion probabilistic models (DDPMs) are a recent family of generative models that achieve state-of-the-art results. In order to obtain class-conditional generation, it was suggested to guide the diffusion process by gradients from a time-dependent classifier. While the idea is theoretically sound, deep learning-based classifiers are infamously susceptible to gradient-based adversarial attacks. Therefore, while traditional classifiers may achieve good accuracy scores, their gradients are possibly unreliable and might hinder the improvement of the generation results. Recent work discovered that adversarially robust classifiers exhibit gradients that are aligned with human perception, and these could better guide a generative process towards semantically meaningful images. We utilize this observation by defining and training a time-dependent adversarially robust classifier and use it as guidance for a generative diffusion model. In experiments on the highly challenging and diverse ImageNet dataset, our scheme introduces significantly more intelligible intermediate gradients, better alignment with theoretical findings, as well as improved generation results under several evaluation metrics. Furthermore, we conduct an opinion survey whose findings indicate that human raters prefer our method's results.

URL: https://openreview.net/forum?id=tEVpz2xJWX

---

Title: PRUDEX-Compass: Towards Systematic Evaluation of Reinforcement Learning in Financial Markets

Authors: Shuo Sun, Molei Qin, Xinrun Wang, Bo An

Abstract: The financial markets, which involve more than $90 trillion market capitals, attract the attention of innumerable investors around the world. Recently, reinforcement learning in financial markets (FinRL) has emerged as a promising direction to train agents for making profitable investment decisions. However, the evaluation of most FinRL methods only focuses on profit-related measures and ignores many critical axes, which are far from satisfactory for financial practitioners to deploy these methods into real-world financial markets. Therefore, we introduce PRUDEX-Compass, which has 6 axes, i.e., Profitability, Risk-control, Universality, Diversity, rEliability, and eXplainability, with a total of 17 measures for a systematic evaluation. Specifically, i) we propose AlphaMix+ as a strong FinRL baseline, which leverages mixture-of-experts (MoE) and risk-sensitive approaches to make diversified risk-aware investment decisions, ii) we evaluate 8 FinRL methods in 4 long-term real-world datasets of influential financial markets to demonstrate the usage of our PRUDEX-Compass, iii) PRUDEX-Compass together with 4 real-world datasets, standard implementation of 8 FinRL methods and a portfolio management environment is released as public resources to facilitate the design and comparison of new FinRL methods. We hope that PRUDEX-Compass can not only shed light on future FinRL research to prevent untrustworthy results from stagnating FinRL into successful industry deployment but also provide a new challenging algorithm evaluation scenario for the reinforcement learning (RL) community.

URL: https://openreview.net/forum?id=JjbsIYOuNi

---

Title: A Unified View of Masked Image Modeling

Authors: Zhiliang Peng, Li Dong, Hangbo Bao, Furu Wei, Qixiang Ye

Abstract: Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8 semantic segmentation mIoU metric on ADE20k (512 size). Code is enclosed in the supplementary materials.

URL: https://openreview.net/forum?id=wmGlMhaBe0

---

Title: How Robust is Your Fairness? Evaluating and Sustaining Fairness under Unseen Distribution Shifts

Authors: Haotao Wang, Junyuan Hong, Jiayu Zhou, Zhangyang Wang

Abstract: Increasing concerns have been raised on deep learning fairness in recent years. Existing fairness-aware machine learning methods mainly focus on the fairness of in-distribution data. However, in real-world applications, it is common to have distribution shift between the training and test data. In this paper, we first show that the fairness achieved by existing methods can be easily broken by slight distribution shifts. To solve this problem, we propose a novel fairness learning method termed CUrvature MAtching (CUMA), which can achieve robust fairness generalizable to unseen domains with unknown distributional shifts. Specifically, CUMA enforces the model to have similar generalization ability on the majority and minority groups, by matching the loss curvature distributions of the two groups. We evaluate our method on three popular fairness datasets. Compared with existing methods, CUMA achieves superior fairness under unseen distribution shifts, without sacrificing either the overall accuracy or the in-distribution fairness.

URL: https://openreview.net/forum?id=11pGlecTz2

---

Title: Leveraging Demonstrations with Latent Space Priors

Authors: Jonas Gehring, Deepak Gopinath, Jungdam Won, Andreas Krause, Gabriel Synnaeve, Nicolas Usunier

Abstract: Demonstrations provide insight into relevant state or action space regions, bearing great potential to boost the efficiency and practicality of reinforcement learning agents. In this work, we propose to leverage demonstration datasets by combining skill learning and sequence modeling. Starting with a learned joint latent space, we separately train a generative model of demonstration sequences and an accompanying low-level policy. The sequence model forms a latent space prior over plausible demonstration behaviors to accelerate learning of high-level policies. We show how to acquire such priors from state-only motion capture demonstrations and explore several methods for integrating them into policy learning on transfer tasks. Our experimental results confirm that latent space priors provide significant gains in learning speed and final performance. We benchmark our approach on a set of challenging sparse-reward environments with a complex, simulated humanoid, and on offline RL benchmarks for navigation and object manipulation.

URL: https://openreview.net/forum?id=OzGIu4T4Cz

---

Title: Solving Nonconvex-Nonconcave Min-Max Problems exhibiting Weak Minty Solutions

Authors: Axel Böhm

Abstract: We investigate a structured class of nonconvex-nonconcave min-max problems exhibiting so-called \emph{weak Minty} solutions, a notion which was only recently introduced, but is able to simultaneously capture different generalizations of monotonicity. We prove novel convergence results for a generalized version of the optimistic gradient method (OGDA) in this setting, matching the $1/k$ rate for the best iterate in terms of the squared operator norm recently shown for the extragradient method (EG). In addition we propose an adaptive step size version of EG, which does not require knowledge of the problem parameters.

URL: https://openreview.net/forum?id=Gp0pHyUyrb

---

Title: Extreme Masking for Learning Instance and Distributed Visual Representations

Authors: Zhirong Wu, Zihang Lai, Xiao Sun, Stephen Lin

Abstract: The paper presents a scalable approach for learning spatially distributed visual representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent spatially distributed tokens, followed by cross-attention blocks to aggregate the holistic instance. The core of the approach is the use of extremely large token masking (75\%-90\%) as the data augmentation for supervision. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input. Instead of encouraging invariance across inputs, learning requires the model to capture informative variations in an image.

The paper makes three contributions: 1) It presents random masking as a strong and computationally efficient data augmentation for siamese representation learning. 2) With multiple sampling per instance, extreme masking greatly speeds up learning and improves performance with more data. 3) ExtreMA obtains stronger linear probing performance than masked modeling methods, and better transfer performance than prior contrastive models.

URL: https://openreview.net/forum?id=3epEbhdgbv

---

New submissions
===============

Title: Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale

Abstract: In this paper, we address the following problem: Given an offline demonstration dataset from an imperfect expert, what is the best way to leverage it to bootstrap online learning performance in MDPs. We first propose an Informed Posterior Sampling-based RL (iPSRL) algorithm that uses the offline dataset, and information about the expert's behavioral policy used to generate the offline dataset. Its cumulative Bayesian regret goes down to zero exponentially fast in $N$, the offline dataset size if the expert is competent enough. Since this algorithm is computationally impractical, we then propose the iRLSVI algorithm that can be seen as a combination of the RLSVI algorithm for online RL, and imitation learning. Our empirical results show that the proposed iRLSVI algorithm is able to achieve significant reduction in regret as compared to two baselines: no offline data, and offline dataset but used without information about the generative policy.
Our algorithm bridges online RL and imitation learning for the first time.

URL: https://openreview.net/forum?id=lanGfX0M6C

---

Title: Exploiting Latent Properties to Optimize Neural Codecs

Abstract: End-to-end image/video codecs are getting competitive compared to traditional compression techniques that have been developed through decades of manual engineering efforts. These trainable codecs have many advantages over traditional techniques such as easy adaptation on perceptual distortion metrics and high performance on specific domains thanks to their learning ability. However, state of the art neural codecs do not take advantage of vector quantization technique and existence of gradient of entropy in decoding device. In this research, we propose some theoretical insights about these two properties (quantization and entropy gradient), and show that this can improve the performances of many off-the-shelf codecs. First, we prove that non-uniform quantization map on neural codec’s latent is not necessary. Thus, we improve the performance by using a predefined optimal uniform vector quantization map. Secondly, we theoretically show that gradient of entropy (available at decoder side) is correlated with the gradient of the reconstruction error (which is not available at decoder side). Thus, we use the former as a proxy in order to improve the compression performance. According to our results, we save between 2-4% of rate for the same quality with this proposal, for various pre-trained methods.

URL: https://openreview.net/forum?id=Sv0FWYkQgh

---

Title: Distributed SGD in overparameterized Linear Regression

Abstract: We consider distributed learning using constant stepsize SGD over several devices, each sending a final model update to a central server. In a final step, the local estimates are aggregated. We prove in the setting of overparameterized linear regression general upper bounds with matching lower bounds and derive learning rates for specific data generating
distributions. We show that the excess risk is of order of the variance provided the number of local nodes grows not too large with the global sample size.

We further compare distributed SGD with distributed ridge regression and provide an upper bound of the excess SGD-risk in terms of the excess RR-risk for a certain range of the sample
size.

URL: https://openreview.net/forum?id=sfrcfYMOnZ

---

Title: On the Convergence and Calibration of Deep Learning with Differential Privacy

Abstract: Differentially private (DP) training preserves the data privacy usually at the cost of slower convergence (and thus lower accuracy), as well as more severe mis-calibration than its non-private counterpart. To analyze the convergence of DP training, we formulate a continuous time analysis through the lens of neural tangent kernel (NTK), which characterizes the per-sample gradient clipping and the noise addition in DP training, for arbitrary network architectures and loss functions. Interestingly, we show that the noise addition only affects the privacy risk but not the convergence or calibration, whereas the per-sample gradient clipping (under both flat and layerwise clipping styles) only affects the convergence and calibration.

Furthermore, we observe that while DP models trained with small clipping norm usually achieve the best accurate, but are poorly calibrated and thus unreliable. In sharp contrast, DP models trained with large clipping norm enjoy the same privacy guarantee and similar accuracy, but are significantly more \textit{calibrated}.

URL: https://openreview.net/forum?id=K0CAGgjYS1

---

Title: Dynamic Subgoal-based Exploration via Bayesian Optimization

Abstract: Policy optimization in unknown, sparse-reward environments with expensive and limited interactions is challenging, and poses a need for effective exploration. Motivated by complex navigation tasks that require real-world training (when cheap simulators are not available), we consider an agent that faces an unknown distribution of environments and must decide on an exploration strategy, through a series of training environments, that can benefit policy learning in a test environment drawn from the environment distribution. Most existing approaches focus on fixed exploration strategies, while the few that view exploration as a meta-optimization problem tend to ignore the need for cost-efficient exploration. We propose a cost-aware Bayesian optimization (BO) approach that efficiently searches over a class of dynamic subgoal-based exploration strategies. The algorithm adjusts a variety of levers --- the locations of the subgoals, the length of each episode, and the number of replications per trial --- in order to overcome the challenges of sparse rewards, expensive interactions, and noise. Our experimental evaluation demonstrates that, when averaged across problem domains, the proposed algorithm outperforms the meta-learning algorithm MAML by 19%, the hyperparameter tuning method Hyperband by 23%, and BO techniques EI and LCB by 24% and 22%, respectively. We also provide a theoretical foundation and prove that the method asymptotically identifies a near-optimal subgoal design from the search space.

URL: https://openreview.net/forum?id=ThJl4d5JRg

---

Title: Designing Injective and Low-Entropic Transformer for Short-Long Range Encoding

Abstract: Multi-headed self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto a sparse manifold and fail to preserve injectivity among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of $6.8\%$ and $5.9\%$, respectively, over the variants of Transformers. TransJect achieves the best average accuracy on the long-range arena benchmark, showcasing its superiority in capturing temporal and spatial hierarchical relationships from long sequences. We further highlight the shortcomings of multi-headed self-attention from the statistical physics viewpoint. Although multi-headed self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. On the contrary, TransJect adapts a mixture of experts for regularization; these experts are found to be more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy, and therefore, can be efficiently scaled to larger depths.

URL: https://openreview.net/forum?id=MOvh472UNH

---

Title: On Intriguing Layer-Wise Properties of Robust Overfitting in Adversarial Training

Abstract: Adversarial training has proven to be one of the most effective methods to defend against
adversarial attacks. Nevertheless, robust overfitting is a common obstacle in adversarial
training of deep networks. There is a common belief that the features learned by different
network layers have different properties, however, existing works generally investigate robust
overfitting by considering a DNN as a single unit and hence the impact of different network
layers on robust overfitting remains unclear. In this work, we divide a DNN into a series of
layers and investigate the effect of different network layers on robust overfitting. We find
that different layers exhibit distinct properties towards robust overfitting, and in particular,
robust overfitting is mostly related to the optimization of latter parts of the network. Based
upon the observed effect, we propose a robust adversarial training (RAT) prototype: in
a minibatch, we optimize the front parts of the network as usual, and adopt additional
measures to regularize the optimization of the latter parts. Based on the prototype, we
designed two realizations of RAT, and extensive experiments demonstrate that RAT can
eliminate robust overfitting and boost adversarial robustness over the standard adversarial
training

URL: https://openreview.net/forum?id=BaoCnmosJz

---

Title: Differentially Private Diffusion Models Generate Useful Synthetic Images

Abstract: The ability to generate privacy-preserving synthetic versions of sensitive image datasets could unlock numerous ML applications currently constrained by data availability. Due to their astonishing image generation quality, diffusion models are a prime candidate for generating high-quality synthetic data. However, recent studies have found that, by default, the outputs of some diffusion models do not preserve training data privacy. By privately fine-tuning ImageNet pre-trained diffusion models with more than 80M parameters, we obtain SOTA results on CIFAR-10 and Camelyon17 in terms of both FID and the accuracy of downstream classifiers trained on synthetic data. We decrease the SOTA FID on CIFAR-10 from 26.8 to 9.8, and increase the accuracy from 51.0% to 88.0%. On synthetic data from Camelyon17, we achieve a downstream accuracy of 91.1% which is close to the SOTA of 96.5% when training on the real data. We leverage the ability of generative models to create infinite amounts of data to maximise the downstream prediction performance, and further show how to use synthetic data for hyperparameter tuning. Our results demonstrate that diffusion models fine-tuned with differential privacy can produce useful and provably private synthetic data, even in applications with significant distribution shift between the pretraining and fine-tuning distributions.

URL: https://openreview.net/forum?id=xohAWn3qXZ

---

Title: Two-Stage Neural Contextual Bandits for Adaptive Personalised Recommendations

Abstract: We consider the problem of personalised recommendations where each user consumes recommendations in a sequential fashion. Personalised recommendation methods that focus on exploiting user interests but ignore exploration will result in biased feedback loops, which hurt recommendation quality in the long term. In this paper, we consider contextual bandits based strategies to address the exploitation-exploration trade-off for large-scale adaptive personalised recommendation systems. In a large-scale system where the number of items is exponentially large, addressing the exploitation-exploration trade-off becomes significantly more challenging that renders most existing standard contextual bandit algorithms inefficient. To systematically address this challenge, we propose a hierarchical neural contextual bandit framework to efficiently learn user preferences. Our hierarchical structure first explores dynamic topics before recommending a set of items. We leverage neural networks to learn non-linear representations of users and items, and use upper confidence bounds (UCBs) as the basis for item recommendation. We propose an additive linear and a bilinear structure for UCB, where the former captures the representation uncertainties of users and items separately while the latter additionally captures the uncertainty of the user-item interaction. We show that our hierarchical framework with our proposed bandit policies exhibits strong computational and performance advantages compared to many standard bandit baselines on two large-scale standard recommendation benchmark datasets.

URL: https://openreview.net/forum?id=6lDmsAHCNo

---

Title: Contextualize Me – The Case for Context in Reinforcement Learning

Abstract: While Reinforcement Learning (RL) has made great strides towards solving increasingly complicated problems, many algorithms are still brittle to even slight environmental changes. Contextual Reinforcement Learning (cRL) provides a framework to model such changes in a principled manner, thereby enabling flexible, precise and interpretable task specification and generation. Therefore cRL formalizes the study of generalization in RL. Our goal is to show how the framework of cRL can contribute to both our theoretical understanding and practical solutions of generalization. We show that theoretically optimal behavior in contextual Markov Decision Processes requires explicit context information. We empirically validate this result on various context-extended versions of common RL environments. They are part of the first benchmark library designed for generalization based on cRL extensions of popular benchmarks, CARL, which we propose as a testbed to study general agents further.

URL: https://openreview.net/forum?id=Y42xVBQusn

---

Title: Aux-Drop: Handling Haphazard Inputs in Online Learning Using Auxiliary Dropouts

Abstract: Many real-world applications based on online learning produce streaming data that is haphazard in nature, i.e., contains missing features, features becoming obsolete in time, the appearance of new features at later points in time and a lack of clarity on the total number of input features. These challenges make it hard to build a learnable system for such applications, and almost no work exists in deep learning that addresses this issue. In this paper, we present Aux-Drop, an auxiliary dropout regularization strategy for online learning that handles the haphazard input features in an effective manner. Aux-Drop adapts the conventional dropout regularization scheme for the haphazard input feature space ensuring that the final output is minimally impacted by the chaotic appearance of such features. It helps to prevent the co-adaptation of especially the auxiliary and base features, as well as reduces the strong dependence of the output on any of the auxiliary inputs of the model. This helps in better learning for scenarios where certain features disappear in time or when new features are to be modeled. The efficacy of Aux-Drop has been demonstrated through extensive numerical experiments on SOTA benchmarking datasets that include Italy Power Demand, HIGGS, SUSY and multiple UCI datasets.

URL: https://openreview.net/forum?id=R9CgBkeZ6Z

---

Title: LTD: Low Temperature Distillation for Gradient Masking-free Adversarial Training

Abstract: Adversarial training has been widely used to enhance the robustness of neural network models against adversarial attacks. However, there is still a notable gap between nature accuracy and robust accuracy. We found one of the reasons is the commonly used labels, one-hot vectors, hinder the learning process for image recognition. Representing an ambiguous image with the one-hot vector is imprecise and the model may fall into a suboptimal solution. In this paper, we propose a method, called Low Temperature Distillation (LTD), which is based on the knowledge distillation framework to generate the desired soft labels. Unlike the previous work, LTD uses a relatively low temperature in the teacher model, and employs different, but fixed, temperatures for the teacher and student models. This modification boosts the robustness without defensive distillation. Moreover, we have investigated the methods to synergize the use of nature data and adversarial ones in LTD. Experimental results show that without extra unlabeled data, the proposed method combined with the previous works achieve 58.19%; 31.13% and 42.08% robust accuracy on CIFAR-10; CIFAR-100 and ImageNet data sets respectively.

URL: https://openreview.net/forum?id=Cx64ppKQ5G

---

Title: Elementwise Language Representation

Abstract: We propose a new technique for computational language representation called elementwise embedding, in which a material (semantic unit) is abstracted into a horizontal concatenation of lower-dimensional element (character) embeddings. While elements are always characters, materials are arbitrary levels of semantic units so it generalizes to any type of tokenization. To focus only on the important letters, the $n^{th}$ spellings of each semantic unit are aligned in $n^{th}$ attention heads, then concatenated back into original forms creating unique embedding representations; they are jointly projected thereby determining own contextual importance. Technically, this framework is achieved by passing a sequence of materials, each consists of $v$ elements, to a transformer having $h=v$ attention heads. As a pure embedding technique, elementwise embedding replaces the $w$-dimensional embedding table of a transformer model with $256$ $c$-dimensional elements (each corresponding to one of UTF-8 bytes) where $c=w/v$. Using this novel approach, we show that the standard transformer architecture can be reused for all levels of language representations and be able to process much longer sequences at the same time-complexity without "any" architectural modification and additional overhead. BERT trained with elementwise embedding outperforms its subword equivalence (original implementation) in multilabel patent document classification exhibiting superior robustness to domain-specificity and data imbalance, despite using $0.005\%$ of embedding parameters. Experiments demonstrate the generalizability of the proposed method by successfully transferring these enhancements to differently architected transformers CANINE and ALBERT.

URL: https://openreview.net/forum?id=J5RDV32Yu9

---

Title: LEAD: Min-Max Optimization from a Physical Perspective

Abstract: Adversarial formulations such as generative adversarial networks (GANs) have rekindled interest in two-player min-max games. A central obstacle in the optimization of such games is the rotational dynamics that hinder their convergence. In this paper, we show that game optimization shares dynamic properties with particle systems subject to multiple forces, and one can leverage tools from physics to improve optimization dynamics. Inspired by the physical framework, we propose LEAD, an optimizer for min-max games. Next, using Lyapunov stability theory and spectral analysis, we study LEAD’s convergence properties in continuous and discrete time settings for a class of quadratic min-max games to demonstrate linear convergence to the Nash equilibrium. Finally, we empirically evaluate our method on synthetic setups and CIFAR-10 image generation to demonstrate improvements in GAN training.

URL: https://openreview.net/forum?id=vXSsTYs6ZB

---

Reply all

Reply to author

Forward

0 new messages