# Weekly TMLR digest for Jul 10, 2022

2 views

### TMLR

Jul 9, 2022, 8:00:08 PMJul 9

Accepted papers
===============

Title: How Expressive are Transformers in Spectral Domain for Graphs?

Authors: Anson Bastos, Abhishek Nadgeri, Kuldeep Singh, Hiroki Kanezashi, Toyotaro Suzumura, Isaiah Onando Mulang'

Abstract: The recent works proposing transformer-based models for graphs have proven the inadequacy of Vanilla Transformer for graph representation learning. To understand this inadequacy, there is a need to investigate if spectral analysis of the transformer will reveal insights into its expressive power. Similar studies already established that spectral analysis of Graph neural networks (GNNs) provides extra perspectives on their expressiveness.
In this work, we systematically study and establish the link between the spatial and spectral domain in the realm of the transformer. We further provide a theoretical analysis that the spatial attention mechanism in the transformer cannot effectively capture the desired frequency response, thus, inherently limiting its expressiveness in spectral space. Therefore, we propose FeTA, a framework that aims to perform attention over the entire graph spectrum (i.e. actual frequency components of the graph) analogous to the attention in spatial space.
Empirical results suggest that FeTA provides homogeneous performance gain against vanilla transformer across all tasks on standard benchmarks and can easily be extended to GNN-based models with low-pass characteristics (e.g., GAT).

URL: https://openreview.net/forum?id=aRsLetumx1

---

Title: Robust and Data-efficient Q-learning by Composite Value-estimation

Authors: Gabriel Kalweit, Maria Kalweit, Joschka Boedecker

Abstract: In the past few years, off-policy reinforcement learning methods have shown promising results in their application to robot control. Q-learning based methods, however, still suffer from poor data-efficiency and are susceptible to stochasticity or noise in the immediate reward, which is limiting with regard to real-world applications. We alleviate this problem by proposing two novel off-policy Temporal-Difference formulations: (1) Truncated Q-functions which represent the return for the first $n$ steps of a target-policy rollout with respect to the full action-value and (2) Shifted Q-functions, acting as the farsighted return after this truncated rollout. This decomposition allows us to optimize both parts with their individual learning rates, achieving significant learning speedup and robustness to variance in the reward signal, leading to the Composite Q-learning algorithm. We show the efficacy of Composite Q-learning in the tabular case and furthermore employ Composite Q-learning within TD3. We compare Composite TD3 with TD3 and TD3($\Delta$), which we introduce as an off-policy variant of TD($\Delta$). Moreover, we show that Composite TD3 outperforms TD3 as well as TD3($\Delta$) significantly in terms of data-efficiency in multiple simulated robot tasks and that Composite Q-learning is robust to stochastic immediate rewards.

URL: https://openreview.net/forum?id=ak6Bds2DcI

---

Title: Iterative State Estimation in Non-linear Dynamical Systems Using Approximate Expectation Propagation

Authors: Sanket Kamthe, So Takao, Shakir Mohamed, Marc Peter Deisenroth

Abstract: Bayesian inference in non-linear dynamical systems seeks to find good posterior approximations of a latent state given a sequence of observations. Gaussian filters and smoothers, including the (extended/unscented) Kalman filter/smoother, which are commonly used in engineering applications, yield Gaussian posteriors on the latent state. While they are computationally efficient, they are often criticised for their crude approximation of the posterior state distribution. In this paper, we address this criticism by proposing a message passing scheme for iterative state estimation in non-linear dynamical systems, which yields more informative (Gaussian) posteriors on the latent states. Our message passing scheme is based on expectation propagation (EP). We prove that classical Rauch--Tung--Striebel (RTS) smoothers, such as the extended Kalman smoother (EKS) or the unscented Kalman smoother (UKS), are special cases of our message passing scheme. Running the message passing scheme more than once can lead to significant improvements of the classical RTS smoothers, so that more informative state estimates can be obtained. We address potential convergence issues of EP by generalising our state estimation framework to damped updates and the consideration of general $\alpha$-divergences.

URL: https://openreview.net/forum?id=xyt4wfdo4J

---

New submissions
===============

Title: Transfer Learning for Segmentation Problems: Choose the Right Encoder and Skip the Decoder

Abstract: It is common practice to reuse models initially trained on different data to increase downstream task performance. Especially in the computer vision domain, ImageNet-pretrained weights have been successfully used for various tasks. In this work, we investigate the impact of transfer learning for segmentation problems, being pixel-wise classification problems that can be tackled with encoder-decoder architectures. We find that transfer learning the decoder does not help downstream segmentation tasks, while transfer learning the encoder is truly beneficial.
We demonstrate that pretrained weights for a decoder may yield faster convergence, but they do not improve the overall model performance as one can obtain equivalent results with randomly initialized decoders.
However, we show that it is more effective to reuse encoder weights trained on a segmentation or reconstruction task than reusing encoder weights trained on classification tasks. This finding implicates that using ImageNet-pretrained encoders for downstream segmentation problems is suboptimal. We also propose a contrastive self-supervised approach with multiple self-reconstruction tasks, which provides encoders that are suitable for transfer learning in segmentation problems in the absence of segmentation labels.

URL: https://openreview.net/forum?id=A3u8KqSBC7

---

Title: Convolution based Variational Bayes for Patient Vital Signs Modeling with Factorial HMM

Abstract: We propose a novel convolution based variational distribution and an EM based learning algorithm to scale factorial HMM to long and complex time series. The number of trainable parameters in our model is independent from the length of the input data. Our model is
also adapted to the use of arbitrarily complex state emission distribution and can therefore be used in combination with patient physiological models. We show the ability of our model to disentangle independent additive processes from synthetic data. Our experiments also confirm that our algorithm is able to fit real world patient data more accurately when several independent Markov chains are used compared to a single Markov chain with a larger state space. Our model could thus offer a scalable, interpretable and versatile alternative to latent space time series models such as standard HMM.

URL: https://openreview.net/forum?id=aSTsODJ3On

---

Title: Retained Singular Values in Probabilistic Image Segmentation with Normalizing Flows and Optimal Transport

Abstract: Latent probabilistic models are a popular choice for quantifying aleatoric uncertainty in image segmentation tasks. Nevertheless, we find that the singular values of the model distributions can vanish and result in a poor latent space. The retainment of latent singular values has been successful in state-of-the-art deterministic self-supervised models by optimizing embeddings on a projected space. In this work, we extend this approach to the probabilistic setting by introducing the Conditional Sinkhorn Auto-encoder (cSAE). It is shown that with Normalizing Flows and Optimal Transport theory, we can project the latent space and improve the learned embeddings of supervised conditional probabilistic segmentation models. This is because the singular values of the learned Normal densities are better retained, thereby improving the ability to accurately quantify the data uncertainty.

URL: https://openreview.net/forum?id=TYliwBILJX

---

Title: GFNet: Geometric Flow Network for 3D Point Cloud Semantic Segmentation

Abstract: Point cloud semantic segmentation from projected views, such as range-view (RV) and bird's-eye-view (BEV), has been intensively investigated. Different views capture different information of point clouds and thus are complementary to each other. However, recent projection-based methods for point cloud semantic segmentation usually utilize a vanilla late fusion strategy for the predictions of different views, failing to explore the complementary information from a geometric perspective during the representation learning. In this paper, we introduce a geometric flow network (GFNet) to explore the geometric correspondence between different views in an align-before-fuse manner. Specifically, we devise a novel geometric flow module (GFM) to bidirectionally align and propagate the complementary information across different views according to geometric relationships under the end-to-end learning scheme. We perform extensive experiments on two widely used benchmark datasets, SemanticKITTI and nuScenes, to demonstrate the effectiveness of our GFNet for project-based point cloud semantic segmentation. Concretely, GFNet not only significantly boosts the performance of each individual view but also achieves state-of-the-art results over all existing projection-based models. Code is available in the supplementary material.

URL: https://openreview.net/forum?id=LSAAlS7Yts

---

Title: On the Near-Optimality of Local Policies in Large Cooperative Multi-Agent Reinforcement Learning

Abstract: We show that in a cooperative $N$-agent network, one can design locally executable policies for the agents such that the resulting discounted sum of average rewards (value) well approximates the optimal value computed over all (including non-local) policies. Specifically, we prove that, if $|\mathcal{X}|, |\mathcal{U}|$ denote the size of state, and action spaces of individual agents, then the approximation error is given by $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]$. Moreover, in a special case where the reward and state transition functions are independent of the action distribution of the population, the error improves to $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\sqrt{|\mathcal{X}|}$. Finally, we also devise an algorithm to explicitly construct a local policy. With the help of our approximation results, we further establish that the constructed local policy is within $\mathcal{O}(\max\{e,\epsilon\})$ distance of the optimal policy, and the sample complexity to achieve such a local policy is $\mathcal{O}(\epsilon^{-3})$, for any $\epsilon>0$.

URL: https://openreview.net/forum?id=t5HkgbxZp1

---

Title: Encoding Hierarchical Information in Neural Networks \\helps in Subpopulation Shift

Abstract: Over the past decade, deep neural networks have proven to be adept in image classification tasks, often surpassing humans in terms of accuracy. However, standard neural networks often fail to understand the concept of hierarchical structures and dependencies among different classes for vision related tasks. Humans on the other hand, seem to intuitively learn categories conceptually, progressively growing from understanding high-level concepts down to granular levels of categories. One of the issues arising from the inability of neural networks to encode such dependencies within its learned structure is that of subpopulation shift -- where models are queried with novel unseen classes taken from a shifted population of the training set categories. Since the neural network treats each class as independent from all others, it struggles to categorize shifting populations that are dependent at higher levels of the hierarchy. In this work, we study the aforementioned problems through the lens of a novel conditional supervised training framework. We tackle subpopulation shift by a structured learning procedure that incorporates hierarchical information conditionally through labels. Furthermore, we introduce a notion of graphical distance to model the catastrophic effect of mispredictions. We show that learning in this structured hierarchical manner results in networks that are more robust against subpopulation shifts, with an improvement up to ~3 % in terms of accuracy and up to 11 % in terms of graphical distance over standard models on subpopulation shift benchmarks.

URL: https://openreview.net/forum?id=t0L4xG4aGC

---

Title: Counterfactual Learning with Multioutput Deep Kernels

Abstract: In this paper, we address the challenge of performing counterfactual inference with observational data via Bayesian nonparametric regression adjustment, with a focus on high-dimensional settings featuring multiple actions and multiple correlated outcomes. We present a general class of counterfactual multi-task deep kernels models that estimate causal effects and learn policies proficiently thanks to their sample efficiency gains, while scaling well with high dimensions. In the first part of the work, we rely on Structural Causal Models (SCM) to formally introduce the setup and the problem of identifying counterfactual quantities under observed confounding. We then discuss the benefits of tackling the task of causal effects estimation via stacked coregionalized Gaussian Processes and Deep Kernels. Finally, we demonstrate the use of the proposed methods on simulated experiments that span individual causal effects estimation, off-policy evaluation and optimization.

URL: https://openreview.net/forum?id=iGREAJdULX

---

Title: ANCER: Anisotropic Certification via Sample-wise Volume Maximization

Abstract: Randomized smoothing has recently emerged as an effective tool that enables certification of deep neural network classifiers at scale. All prior art on randomized smoothing has focused on isotropic $\ell_p$ certification, which has the advantage of yielding certificates that can be easily compared among isotropic methods via $\ell_p$-norm radius. However, isotropic certification limits the region that can be certified around an input to worst-case adversaries, i.e., it cannot reason about other "close", potentially large, constant prediction safe regions. To alleviate this issue, (i) we theoretically extend the isotropic randomized smoothing $\ell_1$ and $\ell_2$ certificates to their generalized anisotropic counterparts following a simplified analysis. Moreover, (ii) we propose evaluation metrics allowing for the comparison of general certificates - a certificate is superior to another if it certifies a superset region - with the quantification of each certificate through the volume of the certified region. We introduce ANCER, a framework for obtaining anisotropic certificates for a given test set sample via volume maximization. We achieve it by generalizing memory-based certification of data-dependent classifiers. Our empirical results demonstrate that ANCER achieves state-of-the-art $\ell_1$ and $\ell_2$ certified accuracy on CIFAR-10 and ImageNet in the data-dependence setting, while certifying larger regions in terms of volume, highlighting the benefits of moving away from isotropic analysis.

URL: https://openreview.net/forum?id=7j0GI6tPYi

---

Title: Do better ImageNet classifiers assess perceptual similarity better?

Abstract: Perceptual distances between images, as measured in the space of pre-trained deep features, have outperformed prior low-level, pixel-based metrics on assessing image similarity. While the capabilities of older and less accurate models such as AlexNet and VGG to capture perceptual similarity are well known, modern and more accurate models are less studied. In this paper, we present a large-scale empirical study to assess how well ImageNet classifiers perform on perceptual similarity. First, we observe a inverse correlation between ImageNet accuracy and Perceptual Scores of modern networks such as ResNets, EfficientNets, and Vision Transformers: that is better classifiers achieve worse Perceptual Scores. Then, we examine the ImageNet accuracy/Perceptual Score relationship on varying the depth, width, number of training steps, weight decay, label smoothing, and dropout. Higher accuracy improves Perceptual Score up to a certain point, but we uncover a Pareto frontier between accuracies and Perceptual Score in the mid-to-high accuracy regime. We explore this relationship further using a number of plausible hypotheses such as distortion invariance, spatial frequency sensitivity, and alternative perceptual functions. Interestingly we discover shallow ResNets, trained for less than 5 epochs only on ImageNet, whose emergent Perceptual Score matches the prior best networks trained directly on supervised human perceptual judgements.

URL: https://openreview.net/forum?id=qrGKGZZvH0

---

Title: Image-graph-image auto-encoding: a conceptual study with symbolic shape classification

Abstract: This work presents basic research on convolutional neural networks that learn to predict explainable scene graphs from input images without external supervision during training. Unlike existing approaches following a fully-supervised training paradigm, thereby requiring meticulous annotations, we are the first to present a self-supervised approach based on a fully differentiable auto-encoder in which the bottleneck is the graph that corresponds to the input image. To demonstrate the unique conceptual properties of our graph auto-encoder, we apply it to an example task that performs simple rule-based shape classification using only the information in the graph, and we show that our approach allows for the successful classification of shapes that are never seen during training. We report exploratory findings of our research in which the presented approach is applied to elementary line drawings depicting single shapes with limited complexity. We show that our approach exhibits comparable performance to a fully-supervised graph parser baseline, and generalizes significantly better than a conventional image classifier. Although extensive future research is needed to bring our approach to complex natural images, we believe it makes a valuable conceptual step in bridging deep neural networks with graph-based symbolic knowledge representations.

URL: https://openreview.net/forum?id=gQxtrFKgFQ

---

Title: Nonstationary Reinforcement Learning with Linear Function Approximation

Abstract: We consider reinforcement learning (RL) in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment. Specifically, both the reward and state transition functions can evolve over time but their total variations do not exceed a \textit{variation budget}. We first develop $\texttt{LSVI-UCB-Restart}$ algorithm, an optimistic modification of least-squares value iteration with periodic restart, and bound its dynamic regret when variation budgets are known. Then we propose a parameter-free algorithm \texttt{Ada-LSVI-UCB-Restart} that extends to unknown variation budgets. We also derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al. (2020). Finally, we provide numerical experiments to demonstrate the effectiveness of our proposed algorithms.

URL: https://openreview.net/forum?id=nS8A9nOrqp

---

Title: Learning to correct spectral methods for simulating turbulent flows

Abstract: Despite their ubiquity throughout science and engineering, only a handful of partial differential equations (PDEs) have analytical, or closed-form solutions. This motivates a vast amount of classical work on numerical simulation of PDEs and more recently, a whirlwind of research into data-driven techniques leveraging machine learning (ML). A recent line of work indicates that a hybrid of classical numerical techniques with machine learning can offer significant improvements over either approach alone. In this work, we show that the choice of the numerical scheme is crucial when incorporating physics-based priors. We build upon Fourier-based spectral methods, which are considerably more efficient than other numerical schemes for simulating PDEs with smooth and periodic solutions. Specifically, we develop ML-augmented spectral solvers for three model PDEs of fluid dynamics, which improve upon the accuracy of standard spectral solvers at the same resolution. We also demonstrate a handful of key design principles for combining machine learning and numerical methods for solving PDEs.

URL: https://openreview.net/forum?id=vIc8P32GRG

---

Title: Learning Algorithms for Markovian Bandits:\\Is Posterior Sampling more Scalable than Optimism?

Abstract: In this paper, we study the scalability of model-based algorithms learning the optimal policy of a discounted markovian bandit problem with $n$ arms. There are two categories of model-based reinforcement learning algorithms: bayesian algorithms (like PSRL), and optimistic algorithms (like UCRL2 or UCBVI). While a naive application of these algorithms is not scalable because the state-space is exponential in $n$, we construct variants specially tailored to markovian bandits (MB) that we call MB-PSRL, MB-UCRL2, and MB-UCBVI. They all have a low regret in $\tilde{O}(S\sqrt{nK})$ -- where $K$ is the number of episodes, $n$ is the number of arms and $S$ is the number of states of each arm. Up to a factor $\sqrt{S}$, these regrets match the lower bound of $\Omega(\sqrt{SnK})$ that we also derive.

Even if their theoretical regrets are comparable, the {\it time complexity} of these algorithms varies greatly: We show that MB-UCRL2, as well as all algorithms that use bonuses on transition matrices have a { time} complexity that grows exponentially in $n$. In contrast, MB-UCBVI does not use bonuses on transition matrices and we show that it can be implemented efficiently, with a time complexity linear in $n$. However, our numerical experiments show that its empirical regret is large. Finally, our bayesian algorithm, MB-PSRL, enjoys the best of both worlds: its running time is linear in the number of arms and its empirical regret is the smallest of all algorithms. This is a new confirmation of the power of bayesian algorithms, that can often be easily tailored to the structure of the problems to learn.

URL: https://openreview.net/forum?id=Sh3RF9JowK

---

Title: CoCa: Contrastive Captioners are Image-Text Foundation Models

Abstract: Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and 91.0% with a finetuned encoder.

URL: https://openreview.net/forum?id=Ee277P3AYC

---

Title: Can You Win Everything with A Lottery Ticket?

Abstract: $\textit{Lottery ticket hypothesis}$ (LTH) has demonstrated to yield independently trainable and highly sparse neural networks (a.k.a. $\textit{winning tickets}$), whose test set accuracies can be surprisingly on par or even better than dense models. However, accuracy is far from the only evaluation metric, and perhaps not always the most important one. Hence it might be myopic to conclude that a sparse subnetwork can replace its dense counterpart, even if the accuracy is preserved. Spurred by that, we perform the first comprehensive assessment of lottery tickets from diverse aspects beyond test accuracy, including $\textit{(i)}$ generalization to distribution shifts, $\textit{(ii)}$ prediction uncertainty, $\textit{(iii)}$ interpretability, and $\textit{(iv)}$ geometry of loss landscapes. With extensive experiments across datasets {CIFAR-10, CIFAR-100, and ImageNet}, model architectures, as well as tens of sparsification methods, we thoroughly characterize the trade-off between model sparsity and the all-dimension model capabilities. We find that an appropriate sparsity (e.g., $20\%\sim99.08\%$) can yield the winning ticket to perform comparably or even better $\textbf{in all above four aspects}$, although some aspects (generalization to certain distribution shifts, and uncertainty) appear more sensitive to the sparsification than others. We term it as a $\texttt{LTH-PASS}$. Overall, our results endorse choosing a good sparse subnetwork of a larger dense model, over directly training a small dense model of similar parameter counts. We hope that our study can offer more in-depth insights on pruning, for researchers and engineers who seek to incorporate sparse neural networks for user-facing deployments. Codes are provided in the supplement.

URL: https://openreview.net/forum?id=JL6MU9XFzW

---

Title: Deep Policies for Online Bipartite Matching: A Reinforcement Learning Approach

Abstract: The challenge in the widely applicable online matching problem lies in making irrevocable assignments while there is uncertainty about future inputs. Most theoretically-grounded policies are myopic or greedy in nature. In real-world applications where the matching process is repeated on a regular basis, the underlying data distribution can be leveraged for better decision-making. We present an end-to-end Reinforcement Learning framework for deriving better matching policies based on trial-and-error on historical data. We devise a set of neural network architectures, design feature representations, and empirically evaluate them across two online matching problems: Edge-Weighted Online Bipartite Matching and Online Submodular Bipartite Matching. We show that most of the learning approaches perform consistently better than classical baseline algorithms on four synthetic and real-world datasets. On average, our proposed models improve the matching quality by 3-10% on a variety of synthetic and real-world datasets.

URL: https://openreview.net/forum?id=mbwm7NdkpO

---

Title: Towards Accurate Subgraph Similarity Computation via Neural Graph Pruning

Abstract: Subgraph similarity search, one of the core problems in graph search, concerns whether a target graph approximately contains a query graph. The problem is recently touched by neural methods. However, current neural methods do not consider pruning the target graph, though pruning is critically important in traditional calculations of subgraph similarities. One obstacle to applying pruning in neural methods is the non-differentiable property of the operation. In this work, we convert graph pruning to a problem of node relabeling and then relax it to a differentiable problem. Based on this idea, we further design a novel neural network to approximate a type of subgraph distance: the subgraph edit distance (SED). Our pruning component is differentiable, so the entire model can be optimized end-to-end. In the design of the model, we propose an attention mechanism to leverage the information about the query graph and guide the pruning of the target graph. Moreover, we develop a multi-head pruning strategy such that the model can better explore multiple ways of pruning the target graph. The proposed model establishes new state-of-the-art results across seven benchmark datasets. Extensive analysis of the model indicates that the proposed model can reasonably prune the target graph for SED computation.

URL: https://openreview.net/forum?id=CfzIsWWBlo

---

Title: Non-stationary Contextual Pricing with Safety Constraints

Abstract: In a contextual pricing problem, a seller aims at maximizing the revenue over a sequence of sales sessions (described by feature vectors) using binary-censored feedback of "sold" or "not sold". Existing methods often overlook two practical challenges (1) the best pricing strategy could change over time; (2) the prices and pricing policies must conform to hard constraints due to safety, ethical or legal restrictions. We address both challenges by solving a more general problem of universal dynamic regret minimization in proper online learning with exp-concave losses --- an open problem posed by Baby & Wang (2021) that we partially resolve in this paper. Here "dynamic regret" measures the performance relative to a non-stationary sequence of policies, and "proper" means that the learner must choose feasible strategies within a pre-defined convex set, which we use to model the safety constraints. In the case of a known log-concave market noise, our algorithm achieves $\tilde{O}(d^3T^{1/3}C_T^{2/3} \vee d^3)$ dynamic regret that is optimal w.r.t. $T$ and $C_T$, adapts to unknown non-stationarity $C_T$ and remains feasible throughout. We also report other results under weaker assumptions. To the best of our knowledge, we are the first to obtain provable guarantees in non-stationary contextual pricing.

URL: https://openreview.net/forum?id=fWIQ9Oaao0

---

Title: Low-regret Active Learning

Abstract: We develop an online learning algorithm for identifying unlabeled data points that are most informative for training (i.e., active learning). By formulating the active learning problem as the prediction with sleeping experts problem, we provide a regret minimization framework for identifying relevant data with respect to any given definition of informativeness. Motivated by the successes of ensembles in active learning, we define regret with respect to an omnipotent algorithm that has access to an infinity large ensemble. At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on easy instances while remaining resilient to adversarial ones. Low regret implies that we can be provably competitive with an ensemble method without the computational burden of having to train an ensemble. This stands in contrast to state-of-the-art active learning methods that are overwhelmingly based on greedy selection, and hence cannot ensure good performance across problem instances with high amounts of noise. We present empirical results demonstrating that our method (i) instantiated with an informativeness measure consistently outperforms its greedy counterpart and (ii) reliably outperforms uniform sampling on real-world scenarios.

URL: https://openreview.net/forum?id=3dSZUYjPwN

---

Title: Concentration inequalities and optimal number of layers for stochastic deep neural networks

Abstract: We state concentration and martingale inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal number of layers for the SDNN via an optimal stopping procedure. We apply our analysis to a stochastic version of a feedforward neural network with ReLU activation function.

URL: https://openreview.net/forum?id=7Ucy1IQo8V

---

Title: Direct Molecular Conformation Generation

Abstract: Molecular conformation generation aims to generate three-dimensional coordinates of all the atoms in a molecule and is an important task in bioinformatics and pharmacology. Previous methods usually first predict the interatomic distances, the gradients of interatomic distances or the local structures (e.g., torsion angles) of a molecule, and then reconstruct its 3D conformation. How to directly generate the conformation without the above intermediate values is not fully explored. In this work, we propose a method that directly predicts the coordinates of atoms: (1) the loss function is invariant to roto-translation of coordinates and permutation of symmetric atoms; (2) the newly proposed model adaptively aggregates the bond and atom information and iteratively refines the coordinates of the generated conformation. Our method achieves the best results on GEOM-QM9 and GEOM-Drugs datasets. Further analysis shows that our generated conformations have closer properties (e.g., HOMO-LUMO gap) with the groundtruth conformations. In addition, our method improves molecular docking by providing better initial conformations. All the results demonstrate the effectiveness of our method and the great potential of the direct approach. The code is released at \url{https://github.com/DirectMolecularConfGen/DMCG}.

URL: https://openreview.net/forum?id=lCPOHiztuw

---