Weekly TMLR digest for Dec 11, 2022

2 views

Skip to first unread message

TMLR

unread,

Dec 10, 2022, 7:00:09 PM12/10/22

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: Queried Unlabeled Data Improves and Robustifies Class-Incremental Learning

Tianlong Chen, Sijia Liu, Shiyu Chang, Lisa Amini, Zhangyang Wang

https://openreview.net/forum?id=oLvlPJheCD

---

Accepted papers
===============

Title: Distribution Embedding Networks for Generalization from a Diverse Set of Classification Tasks

Authors: Lang Liu, Mahdi Milani Fard, Sen Zhao

Abstract: We propose Distribution Embedding Networks (DEN) for classification with small data. In the same spirit of meta-learning, DEN learns from a diverse set of training tasks with the goal to generalize to unseen target tasks. Unlike existing approaches which require the inputs of training and target tasks to have the same dimension with possibly similar distributions, DEN allows training and target tasks to live in heterogeneous input spaces. This is especially useful for tabular-data tasks where labeled data from related tasks are scarce. DEN uses a three-block architecture: a covariate transformation block followed by a distribution embedding block and then a classification block. We provide theoretical insights to show that this architecture allows the embedding and classification blocks to be fixed after pre-training on a diverse set of tasks; only the covariate transformation block with relatively few parameters needs to be fine-tuned for each new task. To facilitate training, we also propose an approach to synthesize binary classification tasks, and demonstrate that DEN outperforms existing methods in a number of synthetic and real tasks in numerical studies.

URL: https://openreview.net/forum?id=F2rG2CXsgO

---

Title: COIN++: Neural Compression Across Modalities

Authors: Emilien Dupont, Hrushikesh Loya, Milad Alizadeh, Adam Golinski, Yee Whye Teh, Arnaud Doucet

Abstract: Neural compression algorithms are typically based on autoencoders that require specialized encoder and decoder architectures for different data modalities. In this paper, we propose COIN++, a neural compression framework that seamlessly handles a wide range of data modalities. Our approach is based on converting data to implicit neural representations, i.e. neural functions that map coordinates (such as pixel locations) to features (such as RGB values). Then, instead of storing the weights of the implicit neural representation directly, we store modulations applied to a meta-learned base network as a compressed code for the data. We further quantize and entropy code these modulations, leading to large compression gains while reducing encoding time by two orders of magnitude compared to baselines. We empirically demonstrate the feasibility of our method by compressing various data modalities, from images and audio to medical and climate data.

URL: https://openreview.net/forum?id=NXB0rEM2Tq

---

Title: Systematically and efficiently improving $k$-means initialization by pairwise-nearest-neighbor smoothing

Authors: Carlo Baldassi

Abstract: We present a meta-method for initializing (seeding) the $k$-means
clustering algorithm called PNN-smoothing. It consists in splitting
a given dataset into $J$ random subsets, clustering each of them
individually, and merging the resulting clusterings with the pairwise-nearest-neighbor
(PNN) method. It is a meta-method in the sense that when clustering
the individual subsets any seeding algorithm can be used. If the computational
complexity of that seeding algorithm is linear in the size of the
data $N$ and the number of clusters $k$, PNN-smoothing is also almost
linear with an appropriate choice of $J$, and quite competitive in
practice. We show empirically, using several existing seeding methods
and testing on several synthetic and real datasets, that this procedure
results in systematically better costs. In particular, our method
of enhancing $k$-means++ seeding proves superior in both effectiveness
and speed compared to the popular ``greedy'' $k$-means++ variant. Our implementation
is publicly available at \href{https://github.com/carlobaldassi/KMeansPNNSmoothing.jl}{https://github.com/carlobaldassi/KMeansPNNSmoothing.jl}.

URL: https://openreview.net/forum?id=FTtFAg3pek

---

Title: GhostSR: Learning Ghost Features for Efficient Image Super-Resolution

Authors: Ying Nie, Kai Han, Zhenhua Liu, Chuanjian Liu, Yunhe Wang

Abstract: Modern single image super-resolution (SISR) systems based on convolutional neural networks (CNNs) have achieved impressive performance but require huge computational costs. The problem on feature redundancy has been well studied in visual recognition task, but rarely discussed in SISR. Based on the observation that many features in SISR models are also similar to each other, we propose to use shift operation for generating the redundant features (i.e. ghost features). Compared with depth-wise convolution which is time-consuming on GPU-like devices, shift operation can bring a real inference acceleration for CNNs on common hardware. We analyze the benefits of shift operation in SISR and make the shift orientation learnable based on the Gumbel-Softmax trick. Besides, a clustering procedure is explored based on pre-trained models to identify the intrinsic filters for generating corresponding intrinsic features. The ghost features will be generated by moving these intrinsic features along a certain orientation. Finally, the complete output features are constructed by concatenating the intrinsic and ghost features together. Extensive experiments on several benchmark models and datasets demonstrate that both the non-compact and lightweight SISR CNN models embedded with the proposed method can achieve a comparable performance to the baseline models with a large reduction of parameters, FLOPs and GPU inference latency. For example, we reduce the parameters by 46%, FLOPs by 46% and GPU inference latency by 42% of x2 EDSR model with almost lossless performance. Code will be available at https://gitee.com/mindspore/models/tree/master/research/cv/GhostSR.

URL: https://openreview.net/forum?id=tbd9f3HwPy

---

Title: DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents

Authors: Kushagra Pandey, Avideep Mukherjee, Piyush Rai, Abhishek Kumar

Abstract: Diffusion probabilistic models have been shown to generate state-of-the-art results on several competitive image synthesis benchmarks but lack a low-dimensional, interpretable latent space, and are slow at generation. On the other hand, standard Variational Autoencoders (VAEs) typically have access to a low-dimensional latent space but exhibit poor sample quality. We present DiffuseVAE, a novel generative framework that integrates VAE within a diffusion model framework, and leverage this to design novel conditional parameterizations for diffusion models. We show that the resulting model equips diffusion models with a low-dimensional VAE inferred latent code which can be used for downstream tasks like controllable synthesis. The proposed method also improves upon the speed vs quality tradeoff exhibited in standard unconditional DDPM/DDIM models (for instance, \textbf{FID of 16.47 vs 34.36} using a standard DDIM on the CelebA-HQ-128 benchmark using \textbf{T=10} reverse process steps) without having explicitly trained for such an objective. Furthermore, the proposed model exhibits synthesis quality comparable to state-of-the-art models on standard image synthesis benchmarks like CIFAR-10 and CelebA-64 while outperforming most existing VAE-based methods. Lastly, we show that the proposed method exhibits inherent generalization to different types of noise in the conditioning signal. For reproducibility, our source code is publicly available at \url{https://github.com/kpandey008/DiffuseVAE}.

URL: https://openreview.net/forum?id=ygoNPRiLxw

---

Title: On the Origins of the Block Structure Phenomenon in Neural Network Representations

Authors: Thao Nguyen, Maithra Raghu, Simon Kornblith

Abstract: Recent work by Nguyen et al. (2021) has uncovered a striking phenomenon in large-capacity neural networks: they contain blocks of contiguous hidden layers with highly similar representations. This block structure has two seemingly contradictory properties: on the one hand, its constituent layers exhibit highly similar dominant first principal components (PCs), but on the other hand, their representations, and their common first PC, are highly dissimilar across different random seeds. Our work seeks to reconcile these discrepant properties by investigating the origin of the block structure in relation to the data and training methods. By analyzing properties of the dominant PCs, we find that the block structure arises from dominant datapoints — a small group of examples that share similar image statistics (e.g. background color). However, the set of dominant datapoints, and the precise shared image statistic, can vary across random seeds. Thus, the block structure reflects meaningful dataset statistics, but is simultaneously unique to each model. Through studying hidden layer activations and creating synthetic datapoints, we demonstrate that these simple image statistics dominate the representational geometry of the layers inside the block structure. We explore how the phenomenon evolves through training, finding that the block structure takes shape early in training, but the underlying representations and the corresponding dominant datapoints continue to change substantially. Finally, we study the interplay between the block structure and different training mechanisms, introducing a targeted intervention to eliminate the block structure, as well as examining the effects of pre-training and Shake-Shake regularization.

URL: https://openreview.net/forum?id=9tl6zjLYVS

---

New submissions
===============

Title: Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?

Abstract: Vision Transformers (ViTs) have proven to be effective, in solving 2D image understanding tasks by training over large-scale image datasets; and meanwhile as a somehow separate track, in modeling the 3D visual world too such as voxels or point clouds. However, with the growing hope that transformers can become the ``universal'' modeling tool for heterogeneous data, ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable. That invites an (over-)ambitious question: can we close the gap between the 2D and 3D ViT architectures? As a piloting study, this paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture, with only minimal customization at the input and output levels without redesigning the pipeline. To build a 3D ViT from its 2D sibling, we ``inflate'' the patch embedding and token sequence, accompanied with new positional encoding mechanisms designed to match the 3D data geometry. The resultant ``minimalist'' 3D ViT, named \textbf{Simple3D-Former}, performs surprisingly robustly on popular 3D tasks such as object classification, point cloud segmentation and indoor scene detection, compared to highly customized 3D-specific designs. It can hence act as a strong baseline for new 3D ViTs. Moreover, we note that pursuing a unified 2D-3D ViT design has practical relevance besides just scientific curiosity. Specifically, we demonstrate that Simple3D-Former naturally is able to exploit the wealth of pre-trained weights from large-scale realistic 2D images (e.g., ImageNet), which can be plugged into enhancing the 3D task performance ``for free''.

URL: https://openreview.net/forum?id=umFYHBDCcW

---

Title: Diffusion Probabilistic Modeling for Video Generation

Abstract: Denoising diffusion probabilistic models are a promising new class of generative models that mark a milestone in high-quality image generation. This paper showcases their ability to sequentially generate video, surpassing prior methods in perceptual and probabilistic forecasting metrics. We propose an autoregressive, end-to-end optimized video diffusion model inspired by recent advances in neural video compression. The model successively generates future frames by correcting a deterministic next-frame prediction using a stochastic residual generated by an inverse diffusion process. We compare this approach against five baselines on four datasets involving natural and simulation-based videos. We find significant improvements in terms of perceptual quality for all datasets. Furthermore, by introducing a scalable version of the Continuous Ranked Probability Score (CRPS) applicable to video, we show that our model also outperforms existing approaches in their probabilistic frame forecasting ability.

URL: https://openreview.net/forum?id=Sw4aYWX21a

---

Title: A Simple Nadaraya-Watson Head can offer Explainable and Calibrated Classification

Abstract: In this paper, we empirically analyze a simple, non-learnable, and nonparametric Nadaraya-Watson (NW) prediction head that can be used with any neural network architecture. In the NW head, the prediction is a weighted average of labels from a support set. The weights are computed from distances between the query feature and support features. This is in contrast to the dominant approach of using a learnable classification head (e.g., a fully-connected layer) on the features, which can be challenging to interpret and can yield poorly calibrated predictions. Our empirical results on an array of computer vision tasks demonstrate that the NW head can yield better calibration than its parametric counterpart, while having comparable accuracy and with minimal computational overhead. To further increase inference-time efficiency, we propose a simple approach that involves a clustering step run on the training set to create a relatively small distilled support set. In addition to using the weights as a means of interpreting model predictions, we further present an easy-to-compute ``support influence function,'' which quantifies the influence of a support element on the prediction for a given query. As we demonstrate in our experiments, the influence function can allow the user to debug a trained model. We believe that the NW head is a flexible, interpretable, and highly useful building block that can be used in a range of applications.

URL: https://openreview.net/forum?id=iEq6lhG4O3

---

Title: Why Do Vision Transformers Have Better Adversarial Robustness than CNNs?

Abstract: Deep-learning models have performed excellently in various fields because of advances in computing power and the large-scale datasets used to train large models. However, they have an inherent risk that even a small change in the input can result in a significantly different output of the trained model. Therefore, it is crucial to evaluate the robustness of deep-learning models before we trust the models’ decisions. In this paper, we evaluate the adversarial robustness of convolutional neural networks (CNNs), vision transformers (ViTs), and CNNs + ViTs, which are typical structures commonly used in computer vision, based on four new model-sensitivity metrics that we propose. These metrics were evaluated for random noise and gradient-based adversarial perturbations. For a fair comparison, models with similar capacities were used in each model group, and the experiment was conducted separately using ImageNet-1K and ImageNet-21K as the pretraining data. The experimental results showed that ViTs were more robust than CNNs for gradient-based adversarial attacks, and our quantitative and qualitative analysis of these results brings to light the cause of the difference.

URL: https://openreview.net/forum?id=tcnAvkKwA9

---

Title: Adaptive patch foraging in deep reinforcement learning agents

Abstract: Patch foraging is one of the most heavily studied behavioral optimization challenges in biology. However, despite its importance to biological intelligence, this behavioral optimization problem is understudied in artificial intelligence research. Patch foraging is especially amenable to study given that it has a known optimal solution, which may be difficult to discover given current techniques in deep reinforcement learning. Here, we investigate deep reinforcement learning agents in an ecological patch foraging task. For the first time, we show that machine learning agents can learn to patch forage adaptively in patterns similar to biological foragers, and approach optimal patch foraging behavior when accounting for temporal discounting. Finally, we show emergent internal dynamics in these agents that resemble single-cell recordings from foraging non-human primates, which complements experimental and theoretical work on the neural mechanisms of biological foraging. This work suggests that agents interacting in complex environments with ecologically valid pressures arrive at common solutions, suggesting the emergence of foundational computations behind adaptive, intelligent behavior in both biological and artificial agents.

URL: https://openreview.net/forum?id=a0T3nOP9sB

---

Title: OADAT: Experimental and Synthetic Clinical Optoacoustic Data for Standardized Image Processing

Abstract: Optoacoustic (OA) imaging is based on excitation of biological tissues with nanosecond-duration laser pulses followed by subsequent detection of ultrasound waves generated via light-absorption-mediated thermoelastic expansion. OA imaging features a powerful combination between rich optical contrast and high resolution in deep tissues. This enabled the exploration of a number of attractive new applications both in clinical and laboratory settings. However, no standardized datasets generated with different types of experimental set-up and associated processing methods are available to facilitate advances in broader applications of OA in clinical settings. This complicates an objective comparison between new and established data processing methods, often leading to qualitative results and arbitrary interpretations of the data. In this paper, we provide both experimental and synthetic OA raw signals and reconstructed image domain datasets rendered with different experimental parameters and tomographic acquisition geometries. We further provide trained neural networks to tackle three important challenges related to OA image processing, namely accurate reconstruction under limited view tomographic conditions, removal of spatial undersampling artifacts and anatomical segmentation for improved image reconstruction. Specifically, we define 44 experiments corresponding to the aforementioned challenges as benchmarks to be used as a reference for the development of more advanced processing methods.

URL: https://openreview.net/forum?id=BVi6MhKO0G

---

Title: Learning Energy Conserving Dynamics Efficiently with Hamiltonian Gaussian Processes

Abstract: Hamiltonian mechanics is one of the cornerstones of natural sciences. Recently there has been significant interest in learning Hamiltonian systems in a free-form way directly from trajectory data. Previous methods have tackled the problem of learning from many short, low-noise trajectories, but learning from a small number of long, noisy trajectories, whilst accounting for model uncertainty has not been addressed. In this work, we present a Gaussian process model for Hamiltonian systems with efficient decoupled parameterisation, and introduce an energy-conserving shooting method that allows robust inference from both short and long trajectories. We demonstrate the method's success in learning Hamiltonian systems in various data settings.

URL: https://openreview.net/forum?id=DHEZuKStzH

---

Title: Dirichlet Mechanism for Differentially Private KL Divergence Minimization

Abstract: Given an empirical distribution $f(x)$ of sensitive data $x$, we consider the task of minimizing $F(y) = D_{\text{KL}} (f(x)\Vert y)$ over a probability simplex, while protecting the privacy of $x$. We observe that, if we take the exponential mechanism and use the KL divergence as the loss function, then the resulting algorithm is the $Dirichlet\text{ }mechanism$ that outputs a single draw from a Dirichlet distribution. Motivated by this, we propose a Rényi differentially private (RDP) algorithm that employs the Dirichlet mechanism to solve the KL divergence minimization task. In addition, given $f(x)$ as above and $\hat{y}$ an output of the Dirichlet mechanism, we prove a probability tail bound on $D_{\text{KL}} (f(x)\Vert \hat{y})$, which is then used to derive a lower bound for the sample complexity of our RDP algorithm. Experiments on real-world datasets demonstrate advantages of our algorithm over Gaussian and Laplace mechanisms in supervised classification and maximum likelihood estimation.

URL: https://openreview.net/forum?id=lmr2WwlaFc

---

Title: ATNAS: Automatic Termination for Neural Architecture Search

Abstract: Neural architecture search (NAS) is a framework for automating the design process of a neural network structure. While the recent one-shot approaches have reduced the search cost, there still exists an inherent trade-off between cost and performance. It is important to appropriately stop the search and further reduce the high cost of NAS. Meanwhile, the differentiable architecture search (DARTS), a typical one-shot approach, is known to suffer from overfitting. Heuristic early-stopping strategies have been proposed to overcome such performance degradation. In this paper, we propose a more versatile and principled early-stopping criterion on the basis of the evaluation of a gap between expectation values of generalisation errors of the previous and current search steps with respect to the architecture parameters. The stopping threshold is automatically determined at each search epoch without cost. In numerical experiments, we demonstrate the effectiveness of the proposed method. We stop the one-shot NAS algorithms and evaluate the acquired architectures on the benchmark datasets: NAS-Bench-201 and NATS-Bench. Our algorithm is shown to reduce the cost of the search process while maintaining a high performance.

URL: https://openreview.net/forum?id=egpjcrBFhs

---

Title: CDA: Contrastive-adversarial Domain Adaptation

Abstract: Recent advances in unsupervised domain adaptation (UDA) reveal that adversarial learning on deep neural networks can learn domain invariant features to reduce the shift between source and target domains. While such adversarial approaches achieve domain-level alignment, they ignore the class (label) shift. When class-conditional data distributions significantly differ between the source and target domain, it can generate ambiguous features near class boundaries that are more likely to be misclassified. In this work, we propose a two-stage model for UDA called Contrastive-adversarial Domain Adaptation (CDA). While the adversarial component facilitates domain-level alignment, two-stage contrastive learning exploits class information to achieve higher intra-class compactness across domains resulting in well-separated decision boundaries. Furthermore, the proposed contrastive framework is designed as a plug-and-play module that can be easily embedded with existing adversarial methods for domain adaptation. We conduct experiments on two widely used benchmark datasets for domain adaptation, namely, Office-31 and Digits-5, and demonstrate that CDA achieves state-of-the-art results on both datasets.

URL: https://openreview.net/forum?id=1YoumU2cQd

---

Title: Solving a Special Type of Optimal Transport Problem by a Modified Hungarian Algorithm

Abstract: Computing the empirical Wasserstein distance in the Wasserstein-distance-based independence test is an optimal transport (OT) problem with a special structure. This observation inspires us to study a special type of OT problem and propose {\it a modified Hungarian algorithm} to solve it {\it exactly}. For the OT problem involving two marginals with $m$ and $n$ atoms ($m\geq n$), respectively, the computational complexity of the proposed algorithm is $\mathcal{O}(m^2n)$. Computing the empirical Wasserstein distance in the independence test requires solving this special type of OT problem, where $m=n^2$. The associated computational complexity of the proposed algorithm is $\mathcal{O}(n^5)$, while the order of applying the classic Hungarian algorithm is $\mathcal{O}(n^6)$. In addition to the aforementioned special type of OT problem, it is shown that the modified Hungarian algorithm could be adopted to solve a wider range of OT problems. Broader applications of the proposed algorithm are discussed---solving the one-to-many assignment problem and the many-to-many assignment problem. We conduct numerical experiments to validate our theoretical results. The experiment results demonstrate that the proposed modified Hungarian algorithm compares favorably with the Hungarian algorithm and the well-known Sinkhorn algorithm.

URL: https://openreview.net/forum?id=k5m8xXTOrC

---

Title: KRADA: Known-region-aware Domain Alignment for Open World Semantic Segmentation

Abstract: In semantic segmentation, we aim to train a pixel-level classifier to assign category labels to all pixels in an image, where labeled training images and unlabeled test images are from the same distribution and share the same label set. However, in an open world, the unlabeled test images probably contain unknown categories and have different distributions from the labeled images. Hence, in this paper, we consider a new, more realistic, and more challenging problem setting where the pixel-level classifier has to be trained with labeled images and unlabeled open-world images—we name it open world semantic segmentation (OSS). In OSS, the trained classifier is expected to identify unknown-class pixels and classify known-class pixels well. To solve OSS, we first investigate which distribution that unknown-class pixels obey. Then, motivated by the goodness-of-fit test, we use statistical measurements to show how a pixel fits the distribution of an unknown class and select highly-fitted pixels to form the unknown region in each test image. Eventually, we propose an end-to-end learning framework, known-region-aware domain alignment (KRADA), to distinguish unknown classes while aligning the distributions of known classes in labeled and unlabeled open-world images. The effectiveness of KRADA has been verified on two synthetic tasks and one COVID-19 segmentation task.

URL: https://openreview.net/forum?id=5II12ypVQo

---

Title: Learning to Look by Self-Prediction

Abstract: We present a method for learning active vision skills, to move the camera to observe a robot's sensors from informative points of view, without external rewards or labels. We do this by jointly training a visual predictor network, which predicts future returns of the sensors using pixels, and a camera control agent, which we reward using the negative error of the predictor. The agent thus moves the camera to points of view that are most predictive for a chosen sensor, which we select using a conditioning input to the agent. We observe that despite this noisy learned reward function, the learned policies a exhibit competence by reliably framing the sensor in a specific location in the view, an emergent location which we call a behavioral fovea. We find that replacing the conventional camera with a foveal camera further increases the policies' precision.

URL: https://openreview.net/forum?id=9aXKUJEKwV

---

Title: The Alignment Problem in Curriculum Learning

Abstract: In curriculum learning, teaching involves cooperative selection of sequences of data via plans to facilitate efficient and effective learning.
One-off cooperative selection of data has been mathematically formalized as entropy-regularized optimal transport and the limiting behavior of myopic sequential interactions has been analyzed, both yielding theoretical and practical guarantees.
We recast sequential cooperation with curriculum planning in a reinforcement learning framework and analyze performance mathematically and by simulation.
We prove that infinite length plans are equivalent to not planning under certain assumptions on the method of planning, and isolate instances where monotonicity and hence convergence in the limit hold, as well as cases where it does not. We also demonstrate through simulations that argmax data selection is the same across planning horizons and demonstrate problem-dependent sensitivity of learning to the teacher's planning horizon. Thus, we find that planning ahead yields efficiency at the cost of effectiveness. This failure of alignment is illustrated in particular with grid world examples in which the teacher must attempt to steer the learner away from a particular location in order to reach the desired grid square. We conclude with implications and directions for efficient and effective curricula.

URL: https://openreview.net/forum?id=iPGBcYs9Tk

---

Title: Unifying physical systems’ inductive biases in neural ODE using dynamics constraints

Abstract: Conservation of energy is at the core of many physical phenomena and dynamical systems. There have been a significant number of works in the past few years aimed at predicting the trajectory of motion of dynamical systems using neural networks while adhering to the law of conservation of energy. Most of these works are inspired by classical mechanics such as Hamiltonian and Lagrangian mechanics as well as Neural Ordinary Differential Equations. While these works have been shown to work well in specific domains respectively, there is a lack of a unifying method that is more generally applicable without requiring significant changes to the neural network architectures. In this work, we aim to address this issue by providing a simple method that could be applied to not just energy-conserving systems, but also dissipative systems, by including a different inductive bias in different cases in the form of a regularisation term in the loss function. The proposed method does not require changing the neural network architecture and could form the basis to validate a novel idea, therefore showing promises to accelerate research in this direction.

URL: https://openreview.net/forum?id=ZOAb497iaY

---

Title: A Deep Top-Down Approach to Hierarchically Coherent Probabilistic Forecasting

Abstract: Probabilistic, hierarchically coherent forecasting is a key problem in many practical forecasting applications -- the goal is to obtain coherent probabilistic predictions for a large number of time series arranged in a pre-specified tree hierarchy. In this paper, we present a probabilistic top-down approach to hierarchical forecasting that uses a novel attention-based RNN model to learn the distribution of the proportions according to which each parent prediction is split among its children nodes at any point in time. These probabilistic proportions are then coupled with an independent univariate probabilistic forecasting model for the root time series.
The resulting forecasts are naturally coherent, and provide probabilistic predictions over all time series in the hierarchy. We experiment on several public datasets and demonstrate significant improvements up to 27% on most datasets compared to state-of-the-art probabilistic hierarchical models. Finally, we also provide theoretical justification for the superiority of our top-down approach compared to traditional bottom-up modeling.

URL: https://openreview.net/forum?id=x1jbQLuvSC

---

Title: Online Optimal Tracking of Linear Systems with Adversarial Disturbances

Abstract: This paper presents a memory-augmented control solution to the optimal reference tracking problem for linear systems subject to adversarial disturbances. We assume that the dynamics of the linear system are known and that the reference signal is generated by a linear system
with unknown dynamics. Under these assumptions, finding the optimal tracking controller is formalized as an online convex optimization problem that leverages memory of past disturbance and reference values to capture their temporal effects on the performance. That is, a (disturbance, reference)-action control policy is formalized, which selects the control actions as a linear map of the past disturbance and reference values. The online convex optimization is then formulated over the parameters of the policy on its past disturbance and reference
values to optimize general convex costs. It is shown that our approach outperforms robust control methods and achieves a tight regret bound O(√T).

URL: https://openreview.net/forum?id=5nVJlKgmxp

---

Title: The Stack: 3 TB of permissively licensed source code

Abstract: Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at ANONYMIZED, provide a tool called "Am I in The Stack" for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at ANONYMIZED.

URL: https://openreview.net/forum?id=pxpbTdUEpD

---

Title: Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models

Abstract: The imputation of missing values represents a significant obstacle for many real-world data analysis pipelines. Here, we focus on time series data and put forward SSSD, an imputation model that relies on two emerging technologies, (conditional) diffusion models as state-ofthe-art generative models and structured state space models as internal model architecture, which are particularly suited to capture long-term dependencies in time series data. We demonstrate that SSSD matches or even exceeds state-of-the-art probabilistic imputation and forecasting performance on a broad range of data sets and different missingness scenarios, including the challenging blackout-missing scenarios, where prior approaches failed to provide meaningful results.

URL: https://openreview.net/forum?id=hHiIbk7ApW

---

Title: Non-parametric Diffusion for Scalable Node Classification on Graphs

Abstract: Deep Learning on Graphs was recently made possible with the introduction of Graph Neural Networks (GNNs). GNNs use learnable diffusion processes to propagate information through the graph and improve performance on downstream tasks. However, learning this diffusion process can be expensive in terms of memory and computation. To address this, some methods have proposed simplified diffusion processes to make GNNs more scalable. Methods like Simplified Graph Convolutional Network (SGCN) or the Scalable Inception Graph Network (SIGN) perform diffusion as a pre-processing step, while others like Correct and Smooth (C\&S) do it as a post-processing step. In this paper we highlight that these kinds of diffusion are non-parametric, meaning that diffusion does not rely on learnable parameters. We identify this idea as the core of scalable GNNs and propose Graph Non-Parametric Diffusion (GNPD) as a method which combines ideas from SIGN, SGCN and C\&S, to outperform all three of them on several benchmarking datasets. GNPD alternates non-parametric diffusion with simple linear models which can ignore the graph structure. This gives GNPD a high parameter efficiency, allowing it to compete with models with two orders of magnitude more parameters. Additionally GNPD can also forego spectral embeddings which are the computational bottleneck of the C\&S method.

URL: https://openreview.net/forum?id=LIT8tjs6rJ

---

Title: Controlling Neural Network Smoothness for Neural Algorithmic Reasoning

Abstract:
The modelling framework of neural algorithmic reasoning (Veličković & Blundell, 2021) postulates that a continuous neural network may learn to emulate the discrete reasoning steps of a symbolic algorithm. We investigate the underlying hypothesis in the most simple conceivable scenario – the addition of real numbers. Our results show that two layer neural networks fail to learn the structure of the task, despite containing multiple solutions of the true function within their hypothesis class. Growing the network’s width leads to highly complex error regions in the input space. Moreover, we find that the network fails to generalise with increasing severity i) in the training domain, ii) outside of the training domain but within its convex hull, and iii) outside the training domain’s convex hull. This behaviour can be emulated with Gaussian process regressors that use radial basis function kernels of decreasing length scale. Classical results establish an equivalence between Gaussian processes and infinitely wide neural networks. We demonstrate a tight linkage between the scaling of a network weights’ standard deviation and its effective length scale on a sinusoidal regression problem, suggesting simple modifications to control the length scale of the function learned by a neural network and, thus, its smoothness. This has important applications for the different generalisation scenarios suggested above, but it also suggests a partial remedy to the brittleness of neural network predictions as exposed by adversarial examples. We demonstrate the gains in adversarial robustness that our modification achieves on a standard classification problem of handwritten digit recognition. In conclusion, this work shows inherent problems of neural networks even for the simplest algorithmic tasks which, however, may be partially remedied through links to Gaussian processes.

URL: https://openreview.net/forum?id=JnsGy9uWtI

---

Title: Imagen Video: High Definition Video Generation with Diffusion Models

Abstract: We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding.

URL: https://openreview.net/forum?id=ETH1GDbSPw

---

Title: Extreme Masking for Learning Instance and Distributed Visual Representations

Abstract: The paper presents a scalable approach for learning distributed visual representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent spatially distributed tokens, followed by cross-attention blocks to aggregate the holistic instance. The core of the approach is the use of extremely large token masking (75\%-90\%) as the data augmentation for supervision. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input. Learning requires the model to capture informative variations in an instance, instead of encouraging invariance.

The paper makes three contributions: 1) It presents random masking as a strong and computationally efficient data augmentation for learning generalizable attention representations. 2) With multiple sampling per instance, extreme masking greatly speeds up learning and creates hunger for more data. 3) Distributed representations can be learned from instance supervision alone, unlike per-token supervision in masked modeling.

URL: https://openreview.net/forum?id=3epEbhdgbv

---

Title: Cox-Hawkes: doubly stochastic spatiotemporal Poisson processes

Abstract: Hawkes processes are point process models that have been used to capture self-excitatory
behavior in social interactions, neural activity, earthquakes and viral epidemics. They can
model the occurrence of the times and locations of events. Here we develop a new class of
spatiotemporal Hawkes processes that can capture both triggering and clustering behavior
and we provide an efficient method for performing inference. We use a log-Gaussian Cox
process (LGCP) as prior for the background rate of the Hawkes process which gives arbitrary
flexibility to capture a wide range of underlying background effects (for infectious diseases
these are called endemic effects). The Hawkes process and LGCP are computationally
expensive due to the former having a likelihood with quadratic complexity in the number
of observations and the latter involving inversion of the precision matrix which is cubic
in observations. Here we propose a novel approach to perform MCMC sampling for our
Hawkes process with LGCP background, using pre-trained Gaussian Process generators
which provide direct and cheap access to samples during inference. We show the efficacy
and flexibility of our approach in experiments on simulated data and use our methods to
uncover the trends in a dataset of reported crimes in the US.

URL: https://openreview.net/forum?id=xzCDD9i4IZ

---

Reply all

Reply to author

Forward

0 new messages