Weekly TMLR digest for Oct 09, 2022

3 views

Skip to first unread message

TMLR

unread,

Oct 8, 2022, 8:00:10 PM10/8/22

to tmlr-annou...@googlegroups.com

Accepted papers
===============

Title: Differentiable Model Compression via Pseudo Quantization Noise

Authors: Alexandre Défossez, Yossi Adi, Gabriel Synnaeve

Abstract: We propose DiffQ a differentiable method for model compression for quantizing model parameters without gradient approximations (e.g., Straight Through Estimator). We suggest adding independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. DiffQ is differentiable both with respect to the unquantized weights and the number of bits used. Given a single hyper-parameter balancing between the quantized model size and accuracy, DiffQ optimizes the number of bits used per individual weight or groups of weights, in end-to-end training. We experimentally verify that our method is competitive with STE based quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation. For instance, on the ImageNet dataset, DiffQ compresses a 12 layers transformer-based model by more than a factor of 8, (lower than 4 bits precision per weight on average), with a loss of 0.3\% in model accuracy. Code is available at github.com/facebookresearch/diffq

URL: https://openreview.net/forum?id=DijnKziche

---

Title: Local Kernel Ridge Regression for Scalable, Interpolating, Continuous Regression

Authors: Mingxuan Han, Chenglong Ye, Jeff Phillips

Abstract: We study a localized version of kernel ridge regression that can continuously, smoothly interpolate the underlying function values which are highly non-linear with observed data points. This new method can deal with the data of which (a) local density is highly uneven and (b) the function values change dramatically in certain small but unknown regions. By introducing a new rank-based interpolation scheme, the interpolated values provided by our local method continuously vary with query points. Our method is scalable by avoiding the full matrix inverse, compared with traditional kernel ridge regression.

URL: https://openreview.net/forum?id=EDAk6F8yMM

---

Title: Mace: A flexible framework for membership privacy estimation in generative models

Authors: Yixi Xu, Sumit Mukherjee, Xiyang Liu, Shruti Tople, Rahul M Dodhia, Juan M Lavista Ferres

Abstract: Generative machine learning models are being increasingly viewed as a way to share sensitive data between institutions. While there has been work on developing differentially private generative modeling approaches, these approaches generally lead to sub-par sample quality, limiting their use in real world applications. Another line of work has focused on developing generative models which lead to higher quality samples but currently lack any formal privacy guarantees. In this work, we propose the first formal framework for membership privacy estimation in generative models. We formulate the membership privacy risk as a statistical divergence between training samples and hold-out samples, and propose sample-based methods to estimate this divergence. Compared to previous works, our framework makes more realistic and flexible assumptions. First, we offer a generalizable metric as an alternative to the accuracy metric (Yeom et al., 2018; Hayes et al., 2019) especially for imbalanced datasets. Second, we loosen the assumption of having full access to the underlying distribution from previous studies (Yeom et al., 2018; Jayaraman et al., 2020), and propose sample-based estimations with theoretical guarantees. Third, along with the population-level membership privacy risk estimation via the optimal membership advantage, we offer the individual-level estimation via the individual privacy risk. Fourth, our framework allows adversaries to access the trained model via a customized query, while prior works require specific attributes (Hayes et al., 2019; Chen et al., 2019; Hilprecht et al., 2019).

URL: https://openreview.net/forum?id=Zxm0kNe3u7

---

Title: Fingerprints of Super Resolution Networks

Authors: Jeremy Vonderfecht, Feng Liu

Abstract: Several recent studies have demonstrated that deep-learning based image generation models, such as GANs, can be uniquely identified, and possibly even reverse-engineered, by the fingerprints they leave on their output images. We extend this research to single image super-resolution (SISR) networks. Compared to previously studied models, SISR networks are a uniquely challenging class of image generation model from which to extract and analyze fingerprints, as they can often generate images that closely match the corresponding ground truth and thus likely leave little flexibility to embed signatures. We take SISR models as examples to investigate if the findings from the previous work on fingerprints of GAN-based networks are valid for general image generation models. We show that SISR networks with a high upscaling factor or trained using adversarial loss leave highly distinctive fingerprints, and that under certain conditions, some SISR network hyperparameters can be reverse-engineered from these fingerprints.

URL: https://openreview.net/forum?id=Jj0qSbtwdb

---

Title: Online Double Oracle

Authors: Le Cong Dinh, Stephen Marcus McAleer, Zheng Tian, Nicolas Perez-Nieves, Oliver Slumbers, David Henry Mguni, Jun Wang, Haitham Bou Ammar, Yaodong Yang

Abstract: Solving strategic games with huge action spaces is a critical yet under-explored topic in economics, operations research and artificial intelligence. This paper proposes new learning algorithms for solving two-player zero-sum normal-form games where the number of pure strategies is prohibitively large. Specifically, we combine no-regret analysis from online learning with Double Oracle (DO) from game theory.
Our method---\emph{Online Double Oracle (ODO)}---is provably convergent to a Nash equilibrium (NE). Most importantly, unlike normal DO, ODO is \emph{rational} in the sense that each agent in ODO can exploit a strategic adversary with a regret bound of $\mathcal{O}(\sqrt{ k \log(k)/T})$, where $k$ is not the total number of pure strategies, but rather the size of \emph{effective strategy set}. In many applications, we empirically show that $k$ is linearly dependent on the support size of the NE. On tens of different real-world matrix games, ODO outperforms DO, PSRO, and no-regret algorithms such as Multiplicative Weights Update by a significant margin, both in terms of convergence rate to a NE, and average payoff against strategic adversaries.

URL: https://openreview.net/forum?id=rrMK6hYNSx

---

Title: Attribute Prediction as Multiple Instance Learning

Authors: Diego Marcos, Aike Potze, Wenjia Xu, Devis Tuia, Zeynep Akata

Abstract: Attribute-based representations help machine learning models perform tasks based on human understandable concepts, allowing a closer human-machine collaboration. However, learning attributes that accurately reflect the content of an image is not always straightforward, as per-image ground truth attributes are often not available.
We propose applying the Multiple Instance Learning (MIL) paradigm to attribute learning (AMIL) while only using class-level labels.
We allow the model to under-predict the positive attributes, which may be missing in a particular image due to occlusions or unfavorable pose, but not to over-predict the negative ones, which are almost certainly not present. We evaluate it in the zero-shot learning (ZSL) setting, where training and test classes are disjoint,
and show that this also allows to profit from knowledge about the semantic relatedness of attributes.
In addition, we apply the MIL assumption to ZSL classification and propose MIL-DAP, an attribute-based zero-shot classification method, based on Direct Attribute Prediction (DAP), to evaluate attribute prediction methods when no image-level data is available for evaluation.
Experiments on CUB-200-2011, SUN Attributes and AwA2 show improvements on attribute detection, attribute-based zero-shot classification and weakly supervised part localization.

URL: https://openreview.net/forum?id=nmFczdJtc2

---

Title: Completeness and Coherence Learning for Fast Arbitrary Style Transfer

Authors: Zhijie Wu, Chunjin Song, Guanxiong Chen, Sheng Guo, Weilin Huang

Abstract: Style transfer methods put a premium on two objectives: (1) completeness which encourages the encoding of a complete set of style patterns; (2) coherence which discourages the production of spurious artifacts not found in input styles. While existing methods pursue the two objectives either partially or implicitly, we present the Completeness and Coherence Network (CCNet) which jointly learns completeness and coherence components and rejects their incompatibility, both in an explicit manner. Specifically, we develop an attention mechanism integrated with bi-directional softmax operations for explicit imposition of the two objectives and for their collaborative modelling. We also propose CCLoss as a quantitative measure for evaluating the quality of a stylized image in terms of completeness and coherence. Through an empirical evaluation, we demonstrate that compared with existing methods, our method strikes a better tradeoff between computation costs, generalization ability and stylization quality.

URL: https://openreview.net/forum?id=4N6T6Rop6k

---

Title: sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification

Authors: Gabriel Bénédict, Hendrik Vincent Koops, Daan Odijk, Maarten de Rijke

Abstract: Multilabel classification is the task of attributing multiple labels to examples via predictions. Current models formulate a reduction of the multilabel setting into either multiple binary classifications or multiclass classification, allowing for the use of existing loss functions (sigmoid, cross-entropy, logistic, etc.). These multilabel classification reductions do not accommodate for the prediction of varying numbers of labels per example. Moreover, the loss functions are distant estimates of the performance metrics. We propose sigmoidF1, a loss function that is an approximation of the F1 score that (i) is smooth and tractable for stochastic gradient descent, (ii) naturally approximates a multilabel metric, and (iii) estimates both label suitability and label counts. We show that any confusion matrix metric can be formulated with a smooth surrogate. We evaluate the proposed loss function on text and image datasets, and with a variety of metrics, to account for the complexity of multilabel classification evaluation. sigmoidF1 outperforms other loss functions on one text and two image datasets over several metrics. These results show the effectiveness of using inference-time metrics as loss functions for non-trivial classification problems like multilabel classification.

URL: https://openreview.net/forum?id=gvSHaaD2wQ

---

Title: FLEA: Provably Robust Fair Multisource Learning from Unreliable Training Data

Authors: Eugenia Iofinova, Nikola Konstantinov, Christoph H Lampert

Abstract: Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that identifies and suppresses those data sources that would have a negative impact on fairness or accuracy if they were used for training. As such, FLEA is not a replacement of prior fairness-aware learning methods but rather an augmentation that makes any of them robust against unreliable training data. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that –given enough data– FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half. Our source code and documentation are available at https://github.com/ISTAustria-CVML/FLEA.

URL: https://openreview.net/forum?id=XsPopigZXV

---

Title: GemNet-OC: Developing Graph Neural Networks for Large and Diverse Molecular Simulation Datasets

Authors: Johannes Gasteiger, Muhammed Shuaibi, Anuroop Sriram, Stephan Günnemann, Zachary Ward Ulissi, C. Lawrence Zitnick, Abhishek Das

Abstract: Recent years have seen the advent of molecular simulation datasets that are orders of magnitude larger and more diverse. These new datasets differ substantially in four aspects of complexity: 1. Chemical diversity (number of different elements), 2. system size (number of atoms per sample), 3. dataset size (number of data samples), and 4. domain shift (similarity of the training and test set). Despite these large differences, benchmarks on small and narrow datasets remain the predominant method of demonstrating progress in graph neural networks (GNNs) for molecular simulation, likely due to cheaper training compute requirements. This raises the question -- does GNN progress on small and narrow datasets translate to these more complex datasets? This work investigates this question by first developing the GemNet-OC model based on the large Open Catalyst 2020 (OC20) dataset. GemNet-OC outperforms the previous state-of-the-art on OC20 by 16% while reducing training time by a factor of 10. We then compare the impact of 18 model components and hyperparameter choices on performance in multiple datasets. We find that the resulting model would be drastically different depending on the dataset used for making model choices. To isolate the source of this discrepancy we study six subsets of the OC20 dataset that individually test each of the above-mentioned four dataset aspects. We find that results on the OC-2M subset correlate well with the full OC20 dataset while being substantially cheaper to train on. Our findings challenge the common practice of developing GNNs solely on small datasets, but highlight ways of achieving fast development cycles and generalizable results via moderately-sized, representative datasets such as OC-2M and efficient models such as GemNet-OC. Our code and pretrained model weights are open-sourced.

URL: https://openreview.net/forum?id=u8tvSxm4Bs

---

New submissions
===============

Title: Auxiliary Cross-Modal Representation Learning with Triplet Loss Functions for Online Handwriting Recognition

Abstract: Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types -- such as images and time-series data (e.g., audio or text data) -- requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. Our experiments on synthetic data and handwriting recognition data from sensor-enhanced pens show improved classification accuracy, faster convergence, and better generalizability.

URL: https://openreview.net/forum?id=8isD3lOdIl

---

Title: Hidden Heterogeneity: When to Choose Similarity-Based Calibration

Abstract: Trustworthy classifiers are essential to the adoption of machine learning predictions in many real-world settings. The predicted probability of possible outcomes can inform high-stakes decision making, particularly when assessing the expected value of alternative decisions or the risk of bad outcomes. These decisions require well-calibrated probabilities, not just the correct prediction of the most likely class. Black-box classifier calibration methods can improve the reliability of a classifier’s output without requiring retraining. However, these methods are unable to detect subpopulations where calibration could also improve prediction accuracy. Such subpopulations are said to exhibit “hidden heterogeneity” (HH), because the original classifier did not detect them. The paper proposes a quantitative measure for HH. It also introduces two similarity-weighted calibration methods that can address HH by adapting locally to each test item: SWC weights the calibration set by similarity to the test item, and SWC-HH explicitly incorporates hidden heterogeneity to filter the calibration set. Experiments show that the improvements in calibration achieved by similarity-based calibration methods correlate with the amount of HH present and, given sufficient calibration data, generally exceed calibration achieved by global methods. HH can therefore serve as a useful diagnostic tool for identifying when local calibration methods would be beneficial.

URL: https://openreview.net/forum?id=RA0TDqt3hC

---

Title: COIN++: Neural Compression Across Modalities

Abstract: Neural compression algorithms are typically based on autoencoders that require specialized encoder and decoder architectures for different data modalities. In this paper, we propose COIN++, a neural compression framework that seamlessly handles a wide range of data modalities. Our approach is based on converting data to implicit neural representations, i.e. neural functions that map coordinates (such as pixel locations) to features (such as RGB values). Then, instead of storing the weights of the implicit neural representation directly, we store modulations applied to a meta-learned base network as a compressed code for the data. We further quantize and entropy code these modulations, leading to large compression gains while reducing encoding time by two orders of magnitude compared to baselines. We empirically demonstrate the effectiveness of our method by compressing various data modalities, from images and audio to medical and climate data.

URL: https://openreview.net/forum?id=NXB0rEM2Tq

---

Title: Signed Graph Neural Networks: A Frequency Perspective

Abstract: Graph convolutional networks (GCNs) and its variants are designed for unsigned graphs containing only positive links. Many existing GCNs have been derived from the spectral domain analysis of signals lying over (unsigned) graphs and in each convolution layer they perform low-pass filtering of the input features followed by a learnable linear transformation. Their extension to signed graphs with positive as well as negative links imposes multiple issues including computational irregularities and ambiguous frequency interpretation, making the design of computationally efficient low pass filters challenging. In this paper, we address these issues via spectral analysis of signed graphs and propose two different signed graph neural networks, one keeps only low-frequency information and one also retains high-frequency information. We further introduce magnetic signed Laplacian and use its eigendecomposition for spectral analysis of directed signed graphs. We test our methods for node classification and link sign prediction tasks on signed graphs and achieve state-of-the-art performances.

URL: https://openreview.net/forum?id=RZveYHgZbu

---

Title: Towards Open-world Representation Learning with Wild Data

Abstract: Recent advance in representation learning has shown remarkable performance. However, the vast majority of approaches are limited to the closed-world setting. In this paper, we enrich the landscape of representation learning by tapping into an open-world setting, where unlabeled samples from novel classes can naturally emerge in the wild. To bridge the gap, we introduce a new learning framework, open-world contrastive learning (OpenCon). OpenCon tackles the challenges of learning compact representations for both known and novel classes, and facilitates novelty discovery along the way. We demonstrate the effectiveness of OpenCon on challenging benchmark datasets and establish competitive performance. On the ImageNet dataset, OpenCon significantly outperforms the current best method by 11.9% and 7.4% on novel and overall classification accuracy, respectively. We hope that our work will open up new doors for future work to tackle this important and realistic problem.

URL: https://openreview.net/forum?id=2wWJxtpFer

---

Title: Multiscale Non-stationary Causal Structure Learning from Time Series Data

Abstract: This paper introduces a new type of causal structure, namely multiscale non-stationary directed acyclic graph (MN-DAG), that generalizes DAGs to the time-frequency domain.
Our contribution is twofold.
First, by leveraging results from spectral and causality theories, we expose a novel probabilistic generative model, which allows to sample an MN-DAG according to user-specified priors concerning the time-dependence and multiscale properties of the causal graph.
Second, we devise a Bayesian method for the estimation of MN-DAGs, by means of stochastic variational inference (SVI), called Multiscale Non-Stationary Causal Structure Learner (MN-CASTLE).
In addition to direct observations, MN-CASTLE exploits information from the decomposition of the total power spectrum of time series over different time resolutions.
In our experiments, we first use the proposed model to generate synthetic data according to a latent MN-DAG, showing that the data generated reproduces well-known features of time series in different domains. Then we compare our learning method MN-CASTLE against baseline models on synthetic data generated with different multiscale and non-stationary settings, confirming the good performance of MN-CASTLE.
Finally, we show some insights derived from the application of MN-CASTLE to study the causal structure of 7 global equity markets during the Covid-19 pandemic.

URL: https://openreview.net/forum?id=jlE1439nwL

---

Title: Communication-Efficient Distributionally Robust Decentralized Learning

Abstract: Decentralized learning algorithms empower interconnected devices to share data and computational resources to collaboratively train a machine learning model without the aid of a central coordinator. In the case of heterogeneous data distributions at the network nodes, collaboration can yield predictors with unsatisfactory performance for a subset of the devices. For this reason, in this work we consider the formulation of a distributionally robust decentralized learning task and we propose a decentralized single loop gradient descent/ascent algorithm (AD-GDA) to directly solve the underlying minimax optimization problem. We render our algorithm communication-efficient by employing a compressed consensus scheme and we provide convergence guarantees for smooth convex and non-convex loss functions. Finally, we corroborate the theoretical findings with empirical results that highlight AD-GDA ability to provide unbiased predictors and to greatly improve communication efficiency compared to existing distributionally robust algorithms.

URL: https://openreview.net/forum?id=tnRRHzZPMq

---

Title: Online Unsupervised Learning of Visual Representations and Categories

Abstract: Real world learning scenarios involve a nonstationary distribution of classes with sequential dependencies among the samples, in contrast to the standard machine learning formulation of drawing samples independently from a fixed, typically uniform distribution. Furthermore, real world interactions demand learning on-the-fly from few or no class labels. In this work, we propose an unsupervised model that simultaneously performs online visual representation learning and few-shot learning of new categories without relying on any class labels. Our model is a prototype-based memory network with a control component that determines when to form a new class prototype. We formulate it as an online mixture model, where components are created with only a single new example, and assignments do not have to be balanced, which permits an approximation to natural imbalanced distributions from uncurated raw data. Learning includes a contrastive loss that encourages different views of the same image to be assigned to the same prototype. The result is a mechanism that forms categorical representations of objects in nonstationary environments. Experiments show that our method can learn from an online stream of visual input data and its learned representations are significantly better at category recognition compared to state-of-the-art self-supervised learning methods.

URL: https://openreview.net/forum?id=afoFhWOdra

---

Title: Unsupervised Network Embedding Beyond Homophily

Abstract: Network embedding (NE) approaches have emerged as a predominant technique to represent complex networks and have benefited numerous tasks. However, most NE approaches rely on a homophily assumption to learn embeddings with the guidance of supervisory signals, leaving the \textit{unsupervised} \textit{heterophilous} scenario relatively unexplored. This problem becomes especially relevant in fields where a scarcity of labels exists. Here, we formulate the unsupervised NE task as an $r$-ego network discrimination problem and develop the SELENE framework for learning on networks with homophily and heterophily. Specifically, we design a dual-channel feature embedding pipeline to discriminate $r$-ego networks using node attributes and structural information separately. We employ negative-sample-free self-supervised learning objective functions to optimise the framework to learn intrinsic node embeddings. We show that SELENE's components improve the quality of node embeddings, facilitating the discrimination of connected heterophilous nodes. Comprehensive empirical evaluations on both synthetic and real-world datasets with varying homophily ratios validate the effectiveness of SELENE in homophilous and heterophilous settings showing an up to $13.35\%$ clustering accuracy gain.

URL: https://openreview.net/forum?id=sRgvmXjrmg

---

Title: On the infinite-depth limit of finite-width neural networks

Abstract: In this paper, we study the infinite-depth limit of finite-width residual neural networks with random Gaussian weights. With proper scaling, we show that by fixing the width and taking the depth to infinity, the pre-activations converge in distribution to a zero-drift diffusion process. Unlike the infinite-width limit where the pre-activation converge weakly to a Gaussian random variable, we show that the infinite-depth limit yields different distributions depending on the choice of the activation function. We document two cases where these distributions have closed-form (different) expressions. We further show an intriguing phase-transition phenomenon of the post-activation norms when the width increases from 3 to 4. Lastly, we study the sequential limit infinite-depth-then-infinite-width, and show some key differences with the more commonly studied infinite-width-then-infinite-depth limit.

URL: https://openreview.net/forum?id=RbLsYz1Az9

---

Title: Inducing Early Neural Collapse in Deep Neural Networks for Improved Out-of-Distribution Detection

Abstract: We propose a simple modification to standard ResNet architectures--L2 regularization over feature space--that substantially improves out-of-distribution (OoD) performance on the previously proposed Deep Deterministic Uncertainty (DDU) benchmark. We show that this change also induces early Neural Collapse (NC), an effect under which better OoD performance is more probable. Our method achieves comparable or superior OoD detection scores and classification accuracy in a small fraction of the training time of the benchmark. Additionally, it substantially improves worst case OoD performance over multiple, randomly initialized models. Though we do not suggest that NC is the sole mechanism or a comprehensive explanation for OoD behaviour in deep neural networks (DNN), we believe NC's simple mathematical and geometric structure can provide a framework for analysis of this complex phenomenon in future work.

URL: https://openreview.net/forum?id=fjkN5Ur2d6

---

Title: Are Vision Transformers Robust to Spurious Correlations ?

Abstract: Deep neural networks may be susceptible to learning spurious correlations that hold on average but not in atypical test samples. As with the recent emergence of vision transformer (ViT) models, it remains unexplored how spurious correlations are manifested in such architectures. In this paper, we systematically investigate the robustness of different transformer architectures to spurious correlations on three challenging benchmark datasets and compare their performance with popular CNNs. Our study reveals that when pre-trained on a sufficiently large dataset, ViT models are more robust to spurious correlations than CNNs. Key to their success is the ability to generalize better from the examples where spurious correlations do not hold. Further, we perform extensive ablations and experiments to understand the role of the self-attention mechanism in providing robustness under spuriously correlated environments. We hope that our work will inspire future research on further understanding the robustness of ViT models to spurious correlations.

URL: https://openreview.net/forum?id=0JIzy1TZe5

---

Title: Supervised Feature Selection with Neuron Evolution in Sparse Neural Networks

Abstract: Feature selection that selects an informative subset of variables from data not only enhances the model interpretability and performance but also alleviates the resource demands. In recent years, research into feature selection in neural networks, which are computationally demanding and black-box models, has become popular. However, existing methods usually suffer from high computational costs when applied to high-dimensional datasets. In this paper, inspired by evolution processes, we propose a novel resource-efficient supervised feature selection method using sparse neural networks, named "NeuroFS". By gradually pruning the unimportant features from the input layer of a sparse neural network trained from scratch, NeuroFS derives an informative subset of features efficiently. By performing several experiments on 11 low and high-dimensional real-world benchmarks of different types, we demonstrate that NeuroFS achieves the highest ranking-based score among the considered state-of-the-art supervised feature selection models. We will make the code publicly available on GitHub after acceptance of the paper.

URL: https://openreview.net/forum?id=GcO6ugrLKp

---

Title: AI-SARAH: Adaptive and Implicit Stochastic Recursive Gradient Methods

Abstract: We present AI-SARAH, a practical variant of SARAH. As a variant of SARAH, this algorithm employs the stochastic recursive gradient yet adjusts step-size based on local geometry. AI-SARAH implicitly computes step-size and efficiently estimates local Lipschitz smoothness of stochastic functions. It is fully adaptive, tune-free, straightforward to implement, and computationally efficient. We provide technical insight and intuitive illustrations on its design and convergence. We conduct extensive empirical analysis and demonstrate its strong performance compared with its classical counterparts and other state-of-the-art first-order methods in solving convex machine learning problems.

URL: https://openreview.net/forum?id=WoXJFsJ6Zw

---

Title: Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty

Abstract: Among attempts at giving a theoretical account of the success of deep neural networks, a recent line of work has identified a so-called `lazy' training regime in which the network can be well approximated by its linearization around initialization. Here we investigate the comparative effect of the lazy (linear) and feature learning (non-linear) regimes on subgroups of examples based on their difficulty. Specifically, we show that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones. In other words, the non-linear dynamics tends to sequentialize the learning of examples of increasing difficulty. We illustrate this phenomenon across different ways to quantify example difficulty, including c-score, label noise, and in the presence of easy-to-learn spurious correlations. Our results reveal a new understanding of how deep networks prioritize resources across example difficulty.

URL: https://openreview.net/forum?id=lukVf4VrfP

---

Title: Controlling Confusion via Generalisation Bounds

Abstract: We establish new generalisation bounds for multiclass classification by abstracting to a more general setting of discretised error types. Extending the PAC-Bayes theory, we are hence able to provide fine-grained bounds on performance for multiclass classification, as well as applications to other learning problems including discretisation of regression losses. Tractable training objectives are derived from the bounds. The bounds are uniform over all weightings of the discretised error types and thus can be used to bound weightings not foreseen at training, including the full confusion matrix in the multiclass classification case.

URL: https://openreview.net/forum?id=iIox1e72OK

---

Title: On Characterizing the Trade-off in Invariant Representation Learning

Abstract: Many applications of representation learning, such as privacy-preservation, algorithmic fairness, and domain adaptation, desire explicit control over semantic information being discarded. This goal is formulated as satisfying two objectives: maximizing utility for predicting a target attribute while simultaneously being invariant (independent) to a known semantic attribute. Solutions to invariant representation learning (IRL) problems lead to a trade-off between utility and invariance when they are competing. While existing works study bounds on this trade-off, two questions remain outstanding: 1) What is the exact trade-off between utility and invariance? and 2) What are the encoders (mapping the data to a representation) that achieve the trade-off, and how can we estimate it from training data? This paper addresses these questions for IRLs in reproducing kernel Hilbert spaces (RKHS)s. We derive a closed-form solution for the global optima of the underlying optimization problem for encoders in RKHSs. This in turn yields closed formulae for the exact trade-off, optimal representation dimensionality, and the corresponding encoder(s). We also numerically quantify the trade-off on representative problems and compare them to those achieved by baseline IRL algorithms.

URL: https://openreview.net/forum?id=3gfpBR1ncr

---

Title: Diverse Counterfactual Explanations for Anomaly Detection in Time Series

Abstract: Data-driven algorithms for detecting anomalies in times series data are ubiquitous, but
generally unable to provide helpful explanations for the predictions they make. In this
work we propose a post-hoc explainability method that is applicable to any differentiable
anomaly detection algorithm for time series. Our method provides explanations in the
form of a set of diverse counterfactual examples, i.e., multiple perturbed versions of the
original time series that are similar to the latter but not considered anomalous by the
detection algorithm. Those examples are informative on the important features of the time
series and the magnitude of changes that can be made to render it non-anomalous for the
explained algorithm. We call our method counterfactual ensemble explanation, and test it on
two deep-learning-based anomaly detection models. We apply the latter to univariate and
multivariate real-world data sets and assess the quality of our explanations under several
explainability criteria such as Validity, Plausibility, Closeness and Diversity. We show that
our algorithm can produce valuable explanations; moreover, we propose a novel visualization
of our explanations that can convey a richer interpretation of a detection algorithm’s internal
mechanism than existing post-hoc explainability methods. Additionally, we design a sparse
variant of our method to improve the interpretability of our explanation for high-dimensional
time series anomalies. In this setting, our explanation is localized on only a few dimensions
and can therefore be communicated more efficiently to the model’s user.

URL: https://openreview.net/forum?id=SJ5PuXqi2z

---

Title: Transframer: Arbitrary Frame Prediction with Generative Models

Abstract: We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs sequences of sparse, compressed image features. Transframer is the state-of-the-art on a variety of video generation benchmarks, is competitive with the strongest models on few-shot view synthesis, and can generate coherent 30 second videos from a single image without any explicit geometric information.

A single generalist Transframer simultaneously produces promising results on 8 tasks, including semantic segmentation, image classification and optical flow prediction with no task-specific architectural components, demonstrating that multi-task computer vision can be tackled using probabilistic image models.

Our approach can in principle be applied to a wide range of applications that require learning the conditional structure of annotated image-formatted data.

URL: https://openreview.net/forum?id=OJtYpdiHNo

---

Title: Defense Against Reward Poisoning Attacks in Reinforcement Learning

Abstract: We study defense strategies against reward poisoning attacks in reinforcement learning. As a threat model, we consider cost-effective targeted attacks---these attacks minimally alter rewards to make the attacker's target policy uniquely optimal under the poisoned rewards, with the optimality gap specified by an attack parameter. Our goal is to design agents that are robust against such attacks in terms of the worst-case utility w.r.t. the true, unpoisoned, rewards while computing their policies under the poisoned rewards. We propose an optimization framework for deriving optimal defense policies, both when the attack parameter is known and unknown. For this optimization framework, we first provide characterization results for generic attack cost functions. These results show that the functional form of the attack cost function and the agent's knowledge about it are critical for establishing lower bounds on the agent's performance, as well as for the computational tractability of the defense problem. We then focus on a cost function based on $\ell_2$ norm, for which we show that the defense problem can be efficiently solved and yields defense policies whose expected returns under the true rewards are lower bounded by their expected returns under the poison rewards. Using simulation-based experiments, we demonstrate the effectiveness and robustness of our defense approach.

URL: https://openreview.net/forum?id=goPsLn3RVo

---

Title: A Halfspace-Mass Depth-Based Method for Adversarial Attack Detection

Abstract: Despite the widespread use of deep learning algorithms, vulnerability to adversarial attacks is still an issue limiting their use in critical applications. Detecting these attacks is thus crucial to build reliable algorithms and has received increasing attention in the last few years.
In this paper, we introduce the HalfspAce Mass dePth dEtectoR (HAMPER), a new method to detect adversarial examples by leveraging the concept of data depths, a statistical notion that provides center-outward ordering of points with respect to (w.r.t.) a probability distribution. In particular, the halfspace-mass (HM) depth exhibits attractive properties such as computational efficiency, which makes it a natural candidate for adversarial attack detection in high-dimensional spaces. Additionally, HM is non differentiable making it harder for attackers to directly attack HAMPER via gradient based-methods. We evaluate HAMPER in the context of supervised adversarial attacks detection across four benchmark datasets.
Overall, we empirically show that HAMPER consistently outperforms SOTA methods. In particular, the gains are 13.1% (29.0%) in terms of AUROC (resp. FPR) on SVHN, 14.6% (25.7%) on CIFAR10 and 22.6% (49.0%) on CIFAR100 compared to the best performing method.

URL: https://openreview.net/forum?id=YtU0nDb5e8

---

Title: Differentially private partitioned variational inference

Abstract: Learning a privacy-preserving model from sensitive data which are distributed across multiple devices is an increasingly important problem. The problem is often formulated in the federated learning context, with the aim of learning a single global model while keeping the data distributed. Moreover, Bayesian learning is a popular approach for modelling, since it naturally supports reliable uncertainty estimates. However, Bayesian learning is generally intractable even with centralised non-private data and so approximation techniques such as variational inference are a necessity. Variational inference has recently been extended to the non-private federated learning setting via the partitioned variational inference algorithm. For privacy protection, the current gold standard is called differential privacy. Differential privacy guarantees privacy in a strong, mathematically clearly defined sense.

In this paper, we present differentially private partitioned variational inference, the first general framework for learning a variational approximation to a Bayesian posterior distribution in the federated learning setting while minimising the number of communication rounds and providing differential privacy guarantees for data subjects.

We propose three alternative implementations in the general framework, one based on perturbing local optimisation runs done by individual parties, and two based on perturbing updates to the global model (one using a version of federated averaging, the second one adding virtual parties to the protocol), and compare their properties both theoretically and empirically. We show that perturbing the local optimisation works well with simple and complex models as long as each party has enough local data. However, the privacy is always guaranteed independently by each party. In contrast, perturbing the global updates works best with relatively simple models. Given access to suitable secure primitives, such as secure aggregation or secure shuffling, the performance can be improved by all parties guaranteeing privacy jointly.

URL: https://openreview.net/forum?id=55BcghgicI

---

Title: Beyond Predictions in Neural ODEs: Identification and Interventions

Abstract: Spurred by tremendous success in pattern matching and prediction tasks, researchers increasingly resort to machine learning to aid original scientific discovery. Given large amounts of observational data about a system, can we uncover the rules that govern its evolution? Solving this task holds the great promise of fully understanding the causal interactions and being able to make reliable predictions about the system’s behavior under interventions. We take a step towards such system identification for time-series data generated from systems
of ordinary differential equations (ODEs) using flexible neural ODEs. Neural ODEs have proven successful in learning dynamical systems in terms of recovering observed trajectories. However, their efficacy in learning ground truth dynamics and making predictions under
unseen interventions are still underexplored. We develop a simple regularization scheme for neural ODEs that helps in recovering the dynamics and causal structure from time-series data. Our results on a variety of (non)-linear first and second order systems as well as
real data validate our method. We conclude by showing that we can also make accurate predictions under interventions on variables or the system itself.

URL: https://openreview.net/forum?id=HyPvqFnI5J

---

Title: Temperature check: theory and practice for training models with softmax-cross-entropy losses

Abstract: The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynamics or generalization performance. In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta\textbf{z}||_{2}$. We follow up these analytic results with a large-scale empirical study of a variety of model architectures trained on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude. We provide evidence that the dependence of generalization on $\beta$ is not due to changes in model confidence, but is a dynamical phenomenon. It follows that the addition of $\beta$ as a tunable hyperparameter is key to maximizing model performance. Although we find the optimal $\beta$ to be sensitive to the architecture, our results suggest that tuning $\beta$ over the range $10^{-2}$ to $10^1$ improves performance over all architectures studied. We find that smaller $\beta$ may lead to better peak performance at the cost of learning stability.

URL: https://openreview.net/forum?id=LBA2Jj5Gqn

---

Reply all

Reply to author

Forward

0 new messages