Weekly TMLR digest for Sep 11, 2022

11 views

Skip to first unread message

TMLR

unread,

Sep 10, 2022, 8:00:07 PM9/10/22

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus

https://openreview.net/forum?id=yzkSU5zdwD

---

Survey Certification: A Comprehensive Study of Real-Time Object Detection Networks Across Multiple Domains: A Survey

Elahe Arani, Shruthi Gowda, Ratnajit Mukherjee, Omar Magdy, Senthilkumar Sockalingam Kathiresan, Bahram Zonooz

https://openreview.net/forum?id=ywr5sWqQt4

---

Survey Certification: On the link between conscious function and general intelligence in humans and machines

Arthur Juliani, Kai Arulkumaran, Shuntaro Sasai, Ryota Kanai

https://openreview.net/forum?id=LTyqvLEv5b

---

Reproducibility Certification: Non-Deterministic Behavior of Thompson Sampling with Linear Payoffs and How to Avoid It

Doruk Kilitcioglu, Serdar Kadioglu

https://openreview.net/forum?id=sX9d3gfwtE

---

Survey Certification: Structural Learning in Artificial Neural Networks: A Neural Operator Perspective

Kaitlin Maile, Luga Hervé, Dennis George Wilson

https://openreview.net/forum?id=gzhEGhcsnN

---

Accepted papers
===============

Title: Momentum Capsule Networks

Authors: Josef Gugglberger, Antonio Rodriguez-sanchez, David Peer

Abstract: Capsule networks are a class of neural networks that aim at solving some limiting factors of Convolutional Neural Networks. However, baseline capsule networks have failed to reach state-of-the-art results on more complex datasets due to the high computation and memory requirements. We tackle this problem by proposing a new network architecture, called Momentum Capsule Network (MoCapsNet). MoCapsNets are inspired by Momentum ResNets, a type of network that applies reversible residual building blocks. Reversible networks allow for recalculating activations of the forward pass in the backpropagation algorithm, so those memory requirements can be drastically reduced. In this paper, we provide a framework on how invertible residual building blocks can be applied to capsule networks. We will show that MoCapsNet beats the accuracy of baseline capsule networks on MNIST, SVHN, CIFAR-10 and CIFAR-100 while using considerably less memory. The source code is available on https://github.com/moejoe95/MoCapsNet.

URL: https://openreview.net/forum?id=Su290sknyQ

---

Title: ANCER: Anisotropic Certification via Sample-wise Volume Maximization

Authors: Francisco Eiras, Motasem Alfarra, Philip Torr, M. Pawan Kumar, Puneet K. Dokania, Bernard Ghanem, Adel Bibi

Abstract: Randomized smoothing has recently emerged as an effective tool that enables certification of deep neural network classifiers at scale. All prior art on randomized smoothing has focused on isotropic $\ell_p$ certification, which has the advantage of yielding certificates that can be easily compared among isotropic methods via $\ell_p$-norm radius. However, isotropic certification limits the region that can be certified around an input to worst-case adversaries, i.e., it cannot reason about other "close", potentially large, constant prediction safe regions. To alleviate this issue, (i) we theoretically extend the isotropic randomized smoothing $\ell_1$ and $\ell_2$ certificates to their generalized anisotropic counterparts following a simplified analysis. Moreover, (ii) we propose evaluation metrics allowing for the comparison of general certificates - a certificate is superior to another if it certifies a superset region - with the quantification of each certificate through the volume of the certified region. We introduce ANCER, a framework for obtaining anisotropic certificates for a given test set sample via volume maximization. We achieve it by generalizing memory-based certification of data-dependent classifiers. Our empirical results demonstrate that ANCER achieves state-of-the-art $\ell_1$ and $\ell_2$ certified accuracy on CIFAR-10 and ImageNet in the data-dependence setting, while certifying larger regions in terms of volume, highlighting the benefits of moving away from isotropic analysis.

URL: https://openreview.net/forum?id=7j0GI6tPYi

---

Title: On the Choice of Interpolation Scheme for Neural CDEs

Authors: James Morrill, Patrick Kidger, Lingyi Yang, Terry Lyons

Abstract: Neural controlled differential equations (Neural CDEs) are a continuous-time extension of recurrent neural networks (RNNs), achieving state-of-the-art (SOTA) performance at modelling functions of irregular time series. In order to interpret discrete data in continuous time, current implementations rely on non-causal interpolations of the data. This is fine when the whole time series is observed in advance, but means that Neural CDEs are not suitable for use in \textit{online prediction tasks}, where predictions need to be made in real-time: a major use case for recurrent networks. Here, we show how this limitation may be rectified. First, we identify several theoretical conditions that control paths for Neural CDEs should satisfy, such as boundedness and uniqueness. Second, we use these to motivate the introduction of new schemes that address these conditions, offering in particular measurability (for online prediction), and smoothness (for speed). Third, we empirically benchmark our online Neural CDE model on three continuous monitoring tasks from the MIMIC-IV medical database: we demonstrate improved performance on all tasks against ODE benchmarks, and on two of the three tasks against SOTA non-ODE benchmarks.

URL: https://openreview.net/forum?id=caRBFhxXIG

---

Title: Conformal Prediction Intervals with Temporal Dependence

Authors: Zhen Lin, Shubhendu Trivedi, Jimeng Sun

Abstract: Cross-sectional prediction is common in many domains such as healthcare, including forecasting tasks using electronic health records, where different patients form a cross-section. We focus on the task of constructing valid prediction intervals (PIs) in time series regression with a cross-section. A prediction interval is considered valid if it covers the true response with (a pre-specified) high probability. We first distinguish between two notions of validity in such a setting: cross-sectional and longitudinal. Cross-sectional validity is concerned with validity across the cross-section of the time series data, while longitudinal validity accounts for the temporal dimension. Coverage guarantees along both these dimensions are ideally desirable; however, we show that distribution-free longitudinal validity is theoretically impossible. Despite this limitation, we propose Conformal Prediction with Temporal Dependence (CPTD), a procedure that is able to maintain strict cross-sectional validity while improving longitudinal coverage. CPTD is post-hoc and light-weight, and can easily be used in conjunction with any prediction model as long as a calibration set is available. We focus on neural networks due to their ability to model complicated data such as diagnosis codes for time series regression, and perform extensive experimental validation to verify the efficacy of our approach. We find that CPTD outperforms baselines on a variety of datasets by improving longitudinal coverage and often providing more efficient (narrower) PIs.

URL: https://openreview.net/forum?id=8QoxXTDcsH

---

Title: Meta-Learning Sparse Compression Networks

Authors: Jonathan Schwarz, Yee Whye Teh

Abstract: Recent work in Deep Learning has re-imagined the representation of data as functions mapping from a coordinate space to an underlying continuous signal. When such functions are approximated by neural networks this introduces a compelling alternative to the more common multi-dimensional array representation. Recent work on such Implicit Neural Representations(INRs) has shown that - following careful architecture search - INRs can outperform established compression methods such as JPEG (e.g. Dupont et al., 2021). In this paper, we propose crucial steps towards making such ideas scalable: Firstly, we employ state-of-the-art network sparsification techniques to drastically improve compression. Secondly,introduce the first method allowing for sparsification to be employed in the inner-loop of commonly used Meta-Learning algorithms, drastically improving both compression and the computational cost of learning INRs. The generality of this formalism allows us to present results on diverse data modalities such as images, manifolds, signed distance functions, 3D shapes and scenes, several of which establish new state-of-the-art results.

URL: https://openreview.net/forum?id=Cct7kqbHK6

---

New submissions
===============

Title: A Stochastic Optimization Framework for Fair Risk Minimization

Abstract: Despite the success of large-scale empirical risk minimization (ERM) at achieving high accuracy across a variety of machine learning tasks, fair ERM is hindered by the incompatibility of fairness constraints with stochastic optimization. We consider the problem of fair classification with discrete sensitive attributes and potentially large models and data sets, requiring stochastic solvers. Existing in-processing fairness algorithms are either impractical in the large-scale setting because they require large batches of data at each iteration or they are not guaranteed to converge. In this paper, we develop the first stochastic in-processing fairness algorithm with guaranteed convergence. For demographic parity, equalized odds, and equal opportunity notions of fairness, we provide slight variations of our algorithm–called FERMI–and prove that each of these variations converges in stochastic optimization with any batch size. Empirically, we show that FERMI is amenable to stochastic solvers with multiple (non-binary) sensitive attributes and non-binary targets, performing well even with minibatch size as small as one. Extensive experiments show that FERMI achieves the most favorable tradeoffs between fairness violation and test accuracy across all tested setups compared with state-of-the-art baselines for demographic parity, equalized odds, equal opportunity. These benefits are especially significant with small batch sizes and for non-binary classification with large sensitive sets, making FERMI a practical fairness algorithm for large-scale problems.

URL: https://openreview.net/forum?id=P9Cj6RJmN2

---

Title: Sampling from energy-based models with divergence diagnostics

Abstract: Energy-based models (EBMs) allow flexible specifications of probability distributions. However, sampling from EBMs is non-trivial, usually requiring approximate techniques such as Markov chain Monte Carlo (MCMC). A major downside of MCMC sampling is that it is often impossible to compute the divergence of the sampling distribution from the target distribution: therefore, the quality of the samples cannot be guaranteed. Here, we introduce quasi-rejection sampling (QRS), a simple extension of rejection sampling that performs approximate sampling, but, crucially, does provide divergence diagnostics (in terms of f-divergences, such as KL divergence and total variation distance). We apply QRS to sampling from discrete EBMs over text for controlled generation. We show that we can sample from such EBMs with arbitrary precision in exchange for sampling efficiency and quantify the trade-off between the two by means of the aforementioned diagnostics.

URL: https://openreview.net/forum?id=VW4IrC0n0M

---

Title: Adaptively Phased Algorithm for Linear Contextual Bandits

Abstract: We propose a novel algorithm for the linear contextual bandit problem when the set of arms is finite. Recently the minimax expected regret for this problem is shown to be $\Omega(\sqrt{dT\mathrm{log}T\mathrm{log}K})$ with $T$ rounds, $d$-dimensional contexts, and $K\leq 2^{d/2}$ arms per time. Previous works on phased algorithms attain this lower bound in the worst case up to logarithmic factors \citep{Auer, Chu11} or iterated logarithmic factors \citep{Li19}, but require a priori knowledge of the time horizon $T$ to construct the phases, which limits their use in practice. In this paper we propose a novel phased algorithm that does not require a priori knowledge of $T$, but constructs the phases in an adaptive way. We show that the proposed algorithm guarantees a regret upper bound of order $O(d^{\alpha}\sqrt{T\mathrm{log}T(\mathrm{log}K+\mathrm{log}T)})$ where $\frac{1}{2}\leq \alpha\leq 1$. The proposed algorithm can be viewed as a generalization of Rarely Switching OFUL \citep{Abbasi-Yadkori} by capitalizing on a tight confidence bound for the parameter in each phase obtained through independent rewards in the same phase.

URL: https://openreview.net/forum?id=uHKFly27T1

---

Title: PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Abstract: Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters? We present PolyViT, a model trained on images, audio and video to answer this question. By co-training on different tasks of a single modality, we are able to achieve significant accuracy improvements on 5 standard video- and audio-classification datasets. Furthermore, co-training PolyViT on multiple modalities and tasks leads to a parameter-efficient model which generalizes across multiple domains. In particular, our multi-modal PolyViT trained on 9 datasets across 3 modalities uses 8.3 times fewer parameters and outperforms a state-of-the-art single-task baseline on 2 of these datasets, whilst achieving competitive performance on the others. Finally, this simple and practical approach necessitates less hyperparameter tuning as the per-task hyperparameters can be readily reused.

URL: https://openreview.net/forum?id=zKnqZeUCLO

---

Title: Extracting Local Reasoning Chains of Deep Neural Networks

Abstract: We study how to explain the main steps of inference that a pre-trained deep neural net (DNN) relies on to produce predictions for a (sub)task and its data. This problem is related to network pruning, and interpretable machine learning with the following highlighted diﬀerences: (1) ﬁne-tuning of any neurons/ﬁlters is forbidden; (2) we target a very high pruning rate, e.g., ≥ 95%; (3) the interpretation is for the whole inference process on a few data of a task rather than for individual neurons/ﬁlters or a single sample. In this paper, we introduce NeuroChains to extract the local inference chains by optimizing diﬀerentiable sparse scores for the ﬁlters and layers, which reﬂects their importance in preserving the outputs on a few data drawn from a given (sub)task. Thereby, NeuroChains can extract an extremely small sub-network composed of critical ﬁlters exactly copied from the original pre-trained DNN by removing the ﬁlters/layers with small scores. For samples from the same class, we can then visualize the inference pathway in the pre-trained DNN by applying existing interpretation techniques to the retained ﬁlters and layers. Its architecture reveals how the inference process stitches and integrates the information layer by layer and ﬁlter by ﬁlter. We provide detailed and insightful case studies together with several quantitative analyses over thousands of trials to demonstrate the quality, sparsity, ﬁdelity, and accuracy of the interpretation. In extensive empirical studies, NeuroChains signiﬁcantly enriches the interpretation and makes the inner mechanism of DNNs more transparent.

URL: https://openreview.net/forum?id=RP6G787uD8

---

Title: Shapley Oracle Pruning for Convolutional Neural Networks

Abstract: The recent hardware and algorithmic developments leverage convolutional neural networks to considerable sizes. The performance of neural networks relies then on the interplay of an even larger pool of, possibly correlated and redundant, parameters, huddled in convolutional channels or residual blocks. To this end, we propose a game-theoretic approach based on the Shapley value, which, accounting for neuron synergies, computes the average contribution of a neuron. A significant feature of the method is that it incorporates oracle pruning, the ideal configuration of a compressed network, to build a unique ranking of nodes that satisfy a range of normative criteria. The ranking enables to select top parameters in the network and remove trailing ones, thus creating a smaller and better interpretable model. As applying the Shapley value to numerous neurons is computationally challenging, we introduce three tractable approximations to handle large models and provide pruning in a reasonable time. The experiments show that the proposed normative ranking and its approximations show practical results, obtaining state-of-the-art network compression. The code is available at https://anonymous.4open.science/r/shapley_oracle_pruning1/.

URL: https://openreview.net/forum?id=kEm8Es47dw

---

Title: Meta Automatic Curriculum Learning for Classrooms of Black-Box Students

Abstract: A major challenge in the Deep RL (DRL) community is to train agents able to generalize their control policy over situations never seen in training. Training on diverse tasks has been identified as a key ingredient for good generalization, which pushed researchers towards using rich procedural task generation systems controlled through complex continuous parameter spaces. In such complex task spaces, it is essential to rely on some form of Automatic Curriculum Learning (ACL) to adapt the task sampling distribution to a given learning agent, instead of randomly sampling tasks, as many could end up being either trivial or unfeasible. Since it is hard to get prior knowledge on such task spaces, many ACL algorithms explore the task space to detect progress niches over time. This costly tabula rasa search process needs to be performed for each new learning agents, although they might have similarities in their capabilities profiles. To address this limitation, we introduce the concept of Meta-ACL, and formalize it in the context of black-box RL learners, i.e. algorithms seeking to generalize curriculum generation to an (unknown) distribution of learners. In this work, we present AGAIN, a first instantiation of Meta-ACL, and showcase its benefits for curriculum generation over classical ACL in multiple simulated environments including procedurally generated parkour environments with learners of varying morphologies. Videos and code are available at https://sites.google.com/view/meta-acl.

URL: https://openreview.net/forum?id=9kVicGdoh3

---

Title: A Crisis In Simulation-Based Inference? Beware, Your Posterior Approximations Can Be Unfaithful

Abstract: We present extensive empirical evidence showing that current Bayesian simulation-based inference algorithms can produce computationally unfaithful posterior approximations. Our results show that all benchmarked algorithms -- (S)NPE, (S)NRE, SNL and variants of ABC -- can yield overconfident posterior approximations, which makes them unreliable for scientific use cases and falsificationist inquiry. Failing to address this issue may reduce the range of applicability of simulation-based inference. For this reason, we argue that research efforts should be made towards theoretical and methodological developments of conservative approximate inference algorithms and present research directions towards this objective. In this regard, we show empirical evidence that ensembling posterior surrogates provides more reliable approximations and mitigates the issue.

URL: https://openreview.net/forum?id=LHAbHkt6Aq

---

Title: On Pseudo-Labeling for Class-Mismatch Semi-Supervised Learning

Abstract: When there are unlabeled Out-Of-Distribution (OOD) data from other classes, Semi-Supervised Learning (SSL) methods suffer from severe performance degradation and even get worse than merely training on labeled data. In this paper, we empirically analyze Pseudo-Labeling (PL) in class-mismatched SSL. PL is a simple and representative SSL method that transforms SSL problems into supervised learning by creating pseudo-labels for unlabeled data according to the model's prediction. We aim to answer two main questions: (1) How do OOD data influence PL? (2) What is the proper usage of OOD data with PL? First, we show that the major problem of PL is imbalanced pseudo-labels on OOD data. Second, we find that OOD data can help classify In-Distribution (ID) data given their OOD ground truth labels. Based on the findings, we propose to improve PL in class-mismatched SSL with two components -- Re-balanced Pseudo-Labeling (RPL) and Semantic Exploration Clustering (SEC). RPL re-balances pseudo-labels of high-confidence data, which simultaneously filters out OOD data and addresses the imbalance problem. SEC uses balanced clustering on low-confidence data to create pseudo-labels on extra classes, simulating the process of training with ground truth. Experiments show that our method achieves steady improvement over supervised baseline and state-of-the-art performance under all class mismatch ratios on different benchmarks.

URL: https://openreview.net/forum?id=tLG26QxoD8

---

Title: Collaborative Algorithms for Online Personalized Mean Estimation

Abstract: We consider an online estimation problem involving a set of agents. Each agent has access to a (personal) process that generates samples from a real-valued distribution and seeks to estimate its mean. We study the case where some of the distributions have the same mean, and the agents are allowed to actively query information from other agents. The goal is to design an algorithm that enables each agent to improve its mean estimate thanks to communication with other agents. The means as well as the number of distributions with same mean are unknown, which makes the task nontrivial. We introduce a novel collaborative strategy to solve this online personalized mean estimation problem. We analyze its time complexity and introduce variants that enjoy good performance in numerical experiments. We also extend our approach to the setting where clusters of agents with similar means seek to estimate the mean of their cluster.

URL: https://openreview.net/forum?id=VipljNfZSZ

---

Title: Unsupervised Discovery and Composition of Object Light Fields

Abstract: Neural scene representations, both continuous and discrete, have recently emerged as a powerful new paradigm for 3D scene understanding. Recent efforts have tackled unsupervised discovery of object-centric neural scene representations. However, the high cost of ray-marching, exacerbated by the fact that each object representation has to be ray-marched separately, leads to insufficiently sampled radiance fields and thus, noisy renderings, poor framerates, and high memory and time complexity during training and rendering. Here, we propose to represent objects in an object-centric, compositional scene representation as light fields. We propose a novel light field compositor module that enables reconstructing the global light field from a set of object-centric light fields. Dubbed Compositional Object Light Fields (COLF), our method enables unsupervised learning of object-centric neural scene representations, state-of-the-art reconstruction and novel view synthesis performance on standard datasets, and rendering and training speeds at orders of magnitude faster than existing 3D approaches.

URL: https://openreview.net/forum?id=fDgdqEuKX0

---

Title: Dropped Scheduled Task: Mitigating Negative Transfer in Multi-task Learning using Dynamic Task Dropping

Abstract: In Multi-Task Learning (MTL), K distinct tasks are jointly optimized. With the varying nature and complexities of tasks, few tasks might dominate learning. For other tasks, their respective performances may get compromised due to a negative transfer from dominant tasks. We propose a Dropped-Scheduled Task (DST) algorithm, which probabilistically “drops” specific tasks during joint optimization while scheduling others to reduce negative transfer. For each task, a scheduling probability is decided based on four different metrics: (i) task depth, (ii) number of ground-truth samples per task, (iii) amount of training completed, and (iv) task stagnancy. Based on the scheduling probability, specific tasks get joint computation cycles while others are “dropped”. To demonstrate the effectiveness of the proposed DST algorithm, we perform multi-task learning on three applications and two architectures. Across unilateral (single input) and bilateral (multiple input) multi-task net- works, the chosen applications are (a) face (AFLW), (b) fingerprint (IIITD MOLF, MUST, and NIST SD27), and (c) character recognition (Omniglot) applications. Experimental results show that the proposed DST algorithm has the minimum negative transfer and overall least errors across different state-of-the-art algorithms and tasks.

URL: https://openreview.net/forum?id=myjAVQrRxS

---

Reply all

Reply to author

Forward

0 new messages