Weekly TMLR digest for Oct 22, 2023

9 views
Skip to first unread message

TMLR

unread,
Oct 21, 2023, 8:00:09 PM10/21/23
to tmlr-annou...@googlegroups.com


New certifications
==================

Featured Certification: Inverse Scaling: When Bigger Isn't Better

Ian R. McKenzie, Alexander Lyzhov, Michael Martin Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Xudong Shen, Joe Cavanagh, Andrew George Gritsevskiy, Derik Kauffman, Aaron T. Kirtland, Zhengping Zhou, Yuhui Zhang, Sicong Huang, Daniel Wurgaft, Max Weiss, Alexis Ross, Gabriel Recchia, Alisa Liu, Jiacheng Liu, Tom Tseng, Tomasz Korbak, Najoung Kim, Samuel R. Bowman, Ethan Perez

https://openreview.net/forum?id=DwgRm72GQF

---


Accepted papers
===============


Title: Policy Gradient Algorithms Implicitly Optimize by Continuation

Authors: Adrien Bolland, Gilles Louppe, Damien Ernst

Abstract: Direct policy optimization in reinforcement learning is usually solved with policy-gradient algorithms, which optimize policy parameters via stochastic gradient ascent. This paper provides a new theoretical interpretation and justification of these algorithms. First, we formulate direct policy optimization in the optimization by continuation framework. The latter is a framework for optimizing nonconvex functions where a sequence of surrogate objective functions, called continuations, are locally optimized. Second, we show that optimizing affine Gaussian policies and performing entropy regularization can be interpreted as implicitly optimizing deterministic policies by continuation. Based on these theoretical results, we argue that exploration in policy-gradient algorithms consists in computing a continuation of the return of the policy at hand, and that the variance of policies should be history-dependent functions adapted to avoid local extrema rather than to maximize the return of the policy.

URL: https://openreview.net/forum?id=3Ba6Hd3nZt

---

Title: Inverse Scaling: When Bigger Isn't Better

Authors: Ian R. McKenzie, Alexander Lyzhov, Michael Martin Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Xudong Shen, Joe Cavanagh, Andrew George Gritsevskiy, Derik Kauffman, Aaron T. Kirtland, Zhengping Zhou, Yuhui Zhang, Sicong Huang, Daniel Wurgaft, Max Weiss, Alexis Ross, Gabriel Recchia, Alisa Liu, Jiacheng Liu, Tom Tseng, Tomasz Korbak, Najoung Kim, Samuel R. Bowman, Ethan Perez

Abstract: Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling:
(i) preference to repeat memorized sequences over following in-context instructions,
(ii) imitation of undesirable patterns in the training data,
(iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and
(iv) correct but misleading few-shot demonstrations of the task.
We release the winning datasets at inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.

URL: https://openreview.net/forum?id=DwgRm72GQF

---

Title: Self-Attention in Colors: Another Take on Encoding Graph Structure in Transformers

Authors: Romain Menegaux, Emmanuel Jehanno, Margot Selosse, Julien Mairal

Abstract: We introduce a novel self-attention mechanism, which we call CSA (Chromatic Self-Attention), which extends the notion of attention scores to attention _filters_, independently modulating the feature channels. We showcase CSA in a fully-attentional graph Transformer CGT (Chromatic Graph Transformer) which integrates both graph structural information and edge features, completely bypassing the need for local message-passing components. Our method flexibly encodes graph structure through node-node interactions, by enriching the original edge features with a relative positional encoding scheme. We propose a new scheme based on random walks that encodes both structural and positional information, and show how to incorporate higher-order topological information, such as rings in molecular graphs. Our approach achieves state-of-the-art results on the ZINC benchmark dataset, while providing a flexible framework for encoding graph structure and incorporating higher-order topology.

URL: https://openreview.net/forum?id=3dQCNqqv2d

---

Title: Identifying latent distances with Finslerian geometry

Authors: Alison Pouplin, David Eklund, Carl Henrik Ek, Søren Hauberg

Abstract: Riemannian geometry provides us with powerful tools to explore the latent space of generative models while preserving the underlying structure of the data. The latent space can be equipped it with a Riemannian metric, pulled back from the data manifold. With this metric, we can systematically navigate the space relying on geodesics defined as the shortest curves between two points.

Generative models are often stochastic, causing the data space, the Riemannian metric, and the geodesics, to be stochastic as well. Stochastic objects are at best impractical, and at worst impossible, to manipulate. A common solution is to approximate the stochastic pullback metric by its expectation. But the geodesics derived from this expected Riemannian metric do not correspond to the expected length-minimising curves.

In this work, we propose another metric whose geodesics explicitly minimise the expected length of the pullback metric. We show this metric defines a Finsler metric, and we compare it with the expected Riemannian metric. In high dimensions, we prove that both metrics converge to each other at a rate of $\mathcal{O}\left(\frac{1}{D}\right)$. This convergence implies that the established expected Riemannian metric is an accurate approximation of the theoretically more grounded Finsler metric. This provides justification for using the expected Riemannian metric for practical implementations.

URL: https://openreview.net/forum?id=Q2Gi0TUAdS

---

Title: Discretization Invariant Networks for Learning Maps between Neural Fields

Authors: Clinton Wang, Polina Golland

Abstract: With the emergence of powerful representations of continuous data in the form of neural fields, there is a need for discretization invariant learning: an approach for learning maps between functions on continuous domains without being sensitive to how the function is sampled. We present a new framework for understanding and designing discretization invariant neural networks (DI-Nets), which generalizes many discrete networks such as convolutional neural networks as well as continuous networks such as neural operators. Our analysis establishes upper bounds on the deviation in model outputs under different finite discretizations, and highlights the central role of point set discrepancy in characterizing such bounds. This insight leads to the design of a family of neural networks driven by numerical integration via quasi-Monte Carlo sampling with discretizations of low discrepancy. We prove by construction that DI-Nets universally approximate a large class of maps between integrable function spaces, and show that discretization invariance also describes backpropagation through such models. Applied to neural fields, convolutional DI-Nets can learn to classify and segment visual data under various discretizations, and sometimes generalize to new types of discretizations at test time.

URL: https://openreview.net/forum?id=CpYBAqDgmz

---

Title: Multi-label Node Classification On Graph-Structured Data

Authors: Tianqi Zhao, Thi Ngan Dong, Alan Hanjalic, Megha Khosla

Abstract: Graph Neural Networks (GNNs) have shown state-of-the-art improvements in node classification tasks on graphs. While these improvements have been largely demonstrated in a multi-class classification scenario, a more general and realistic scenario in which each node could have multiple labels has so far received little attention. The first challenge in conducting focused studies on multi-label node classification is the limited number of publicly available multi-label graph datasets. Therefore, as our first contribution, we collect and release three real-world biological datasets and develop a multi-label graph generator to generate datasets with tunable properties. While high label similarity (high homophily) is usually attributed to the success of GNNs, we argue that a multi-label scenario does not follow the usual semantics of homophily and heterophily so far defined for a multi-class scenario. As our second contribution, we define homophily and Cross-Class Neighborhood Similarity for the multi-label scenario and provide a thorough analyses of the collected $9$ multi-label datasets. Finally, we perform a large-scale comparative study with $8$ methods and $9$ datasets and analyse the performances of the methods to assess the progress made by current state of the art in the multi-label node classification scenario. We release our benchmark at https://github.com/Tianqi-py/MLGNC.

URL: https://openreview.net/forum?id=EZhkV2BjDP

---

Title: Physics informed neural networks for elliptic equations with oscillatory differential operators

Authors: Arnav Gangal, Luis Kim, Sean Patrick Carney

Abstract: Physics informed neural network (PINN) based solution methods for differential equations have recently shown success in a variety of scientific computing applications. Several authors have reported difficulties, however, when using PINNs to solve equations with multiscale features. The objective of the present work is to illustrate and explain the difficulty of using standard PINNs for the particular case of divergence-form elliptic partial differential equations (PDEs) with oscillatory coefficients present in the differential operator. We show that if the coefficient in the elliptic operator $a^{\epsilon}(x)$ is of the form $a(x/\epsilon)$ for a 1-periodic coercive function $a(\cdot)$, then the Frobenius norm of the neural tangent kernel (NTK) matrix associated to the loss function grows as $1/\epsilon^2$. This implies that as the separation of scales in the problem increases, training the neural network with gradient descent based methods to achieve an accurate approximation of the solution to the PDE becomes increasingly difficult. Numerical examples illustrate the stiffness of the optimization problem.

URL: https://openreview.net/forum?id=QfyVqvpg7u

---

Title: Greedier is Better: Selecting Multiple Neighbors per Iteration for Sparse Subspace Clustering

Authors: Jwo-Yuh Wu, Liang-Chi Huang, Wen Hsuan Li, Chun-Hung Liu, Rung-Hung Gau

Abstract: Sparse subspace clustering (SSC) using greedy-based neighbor selection, such as orthogonal matching pursuit (OMP), has been known as a popular computationally-efficient alternative to the standard $\ell_1$-minimization based methods. However, existing stopping rules of OMP to halt neighbor search needs additional offline work to estimate some ground truths, e.g., subspace dimension and/or noise strength. This paper proposes a new SSC scheme using generalized OMP (GOMP), a soup-up of OMP whereby multiple, say $p(\geq1)$, neighbors are identified per iteration to further speed up neighbor acquisition, along with a new stopping rule requiring nothing more than a knowledge of the ambient signal dimension and the number $p$ of identified neighbors in each iteration. Compared to conventional OMP (i.e., $p=1$), the proposed GOMP method involves fewer iterations, thereby enjoying lower algorithmic complexity. Under the semi-random model, analytic performance guarantees are provided. It is shown that, with a high probability, (i) GOMP can retrieve more true neighbors than OMP, consequently yielding higher data clustering accuracy, and (ii) the proposed stopping rule terminates neighbor search once the number of recovered neighbors is close to the subspace dimension. Issues about selecting $p$ for practical implementation are also discussed. Computer simulations using both synthetic and real data are provided to demonstrate the effectiveness of the proposed approach and validate our analytic study.

URL: https://openreview.net/forum?id=djD8IbSvgm

---

Title: Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation

Authors: Rustem Islamov, Xun Qian, Slavomir Hanzely, Mher Safaryan, Peter Richtárik

Abstract: Despite their high computation and communication costs, Newton-type methods remain an appealing option for distributed training due to their robustness against ill-conditioned convex problems. In this work, we study communication compression and aggregation mechanisms for curvature information in order to reduce these costs while preserving theoretically superior local convergence guarantees. We prove that the recently developed class of three point compressors (3PC) of [Richtarik et al., 2022] for gradient communication can be generalized to Hessian communication as well. This result opens up a wide variety of communication strategies, such as contractive compression and lazy aggregation, available to our disposal to compress prohibitively costly curvature information. Moreover, we discovered several new 3PC mechanisms, such as adaptive thresholding and Bernoulli aggregation, which require reduced communication and occasional Hessian computations. Furthermore, we extend and analyze our approach to bidirectional communication compression and partial device participation setups to cater to the practical considerations of applications in federated learning. For all our methods, we derive fast condition-number-independent local linear and/or superlinear convergence rates. Finally, with extensive numerical evaluations on convex optimization problems, we illustrate that our designed schemes achieve state-of-the-art communication complexity compared to several key baselines using second-order information.

URL: https://openreview.net/forum?id=NekBTCKJ1H

---

Title: Fourier Features in Reinforcement Learning with Neural Networks

Authors: David Brellmann, David Filliat, Goran Frehse

Abstract: In classic Reinforcement Learning (RL), the performance of algorithms depends critically on data representation, i.e., the way the states of the system are represented as features. Choosing appropriate features for a task is an important way of adding prior domain knowledge since cleverly distributing information into states facilitates appropriate generalization. For linear function approximations, the representation is usually hand-designed according to the task at hand and projected into a higher-dimensional space to facilitate linear separation. Among the feature encodings used in RL for linear function approximation, we can mention in a non-exhaustive way Polynomial Features or Tile Coding. However, the main bottleneck of such feature encodings is that they do not scale to high-dimensional inputs as they grow exponentially in size with the input dimension.

URL: https://openreview.net/forum?id=LWotmCKC6Y

---

Title: Diagnostic Tool for Out-of-Sample Model Evaluation

Authors: Ludvig Hult, Dave Zachariah, Peter Stoica

Abstract: Assessment of model fitness is a key part of machine learning. The standard paradigm of model evaluation is analysis of the average loss over future data. This is often explicit in model fitting, where we select models that minimize the average loss over training data as a surrogate, but comes with limited theoretical guarantees. In this paper, we consider the problem of characterizing a batch of out-of-sample losses of a model using a calibration data set. We provide finite-sample limits on the out-of-sample losses that are statistically valid under quite general conditions and propose a diagonistic tool that is simple to compute and interpret. Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyperparameter tuning.

URL: https://openreview.net/forum?id=Ulf3QZG9DC

---

Title: Straggler-Resilient Personalized Federated Learning

Authors: Isidoros Tziotis, Zebang Shen, Ramtin Pedarsani, Hamed Hassani, Aryan Mokhtari

Abstract: Federated Learning is an emerging learning paradigm that allows training models from samples distributed across a large network of clients while respecting privacy and communication restrictions. Despite its success, federated learning faces several challenges related to its decentralized nature. In this work, we develop a novel algorithmic procedure with theoretical speedup guarantees that simultaneously handles two of these hurdles, namely (i) data heterogeneity, i.e., data distributions can vary substantially across clients, and (ii) system heterogeneity, i.e., the computational power of the clients could differ significantly. By leveraging previous works in the realm of representation learning (Collins et al., 2021; Liang et al., 2020), our method constructs a global common representation utilizing the data from all clients. Additionally, it learns a user-specific set of parameters resulting in a personalized solution for each individual client. Furthermore, it mitigates the effects of stragglers by adaptively selecting clients based on their computational characteristics, thus achieving for the first time near optimal sample complexity and provable logarithmic speedup. Experimental results support our theoretical findings showing the superiority of our method over alternative personalized federated schemes in system and data heterogeneous environments.

URL: https://openreview.net/forum?id=gxEpUFxIgz

---


New submissions
===============


Title: Inverse Kernel Decomposition

Abstract: The state-of-the-art dimensionality reduction approaches largely rely on complicated optimization procedures. On the other hand, closed-form approaches requiring merely eigen-decomposition do not have enough sophistication and nonlinearity. In this paper, we propose a novel nonlinear dimensionality reduction method---Inverse Kernel Decomposition (IKD)---based on an eigen-decomposition of the sample covariance matrix of data. The method is inspired by Gaussian process latent variable models (GPLVMs) and has comparable performance with GPLVMs. To deal with very noisy data with weak correlations, we propose two solutions---blockwise and geodesic---to make use of locally correlated data points and provide better and numerically more stable latent estimations. We use synthetic datasets and four real-world datasets to show that IKD is a better dimensionality reduction method than other eigen-decomposition-based methods, and achieves comparable performance against optimization-based methods with faster running speeds.

URL: https://openreview.net/forum?id=H4OE7toXpa

---

Title: Using Motion Cues to Supervise Single-frame Body Pose & Shape Estimation in Low Data Regimes

Abstract: When enough annotated training data is available, supervised deep-learning algorithms excel at estimating human body pose and shape using a single camera. The effects of too little such data being available can be mitigated by using other information sources, such as databases of body shapes, to learn priors. Unfortunately, such sources are not always available either.
We show that, in such cases, easy to-obtain unannotated videos can be used instead to provide the required supervisory signals. Given a trained model using too little annotated data, we compute poses in consecutive frames along with the optical flow between them. We then enforce consistency between the image optical flow and the one that can be inferred from the change in pose from one frame to the next. This provides enough additional supervision to effectively refine the network weights and to perform on par with methods trained using far more annotated data.

URL: https://openreview.net/forum?id=fUhOb14sQv

---

Title: PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

Abstract: Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the framework with new auxiliary tasks or extra pre-trained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks.
Based on this analysis, we propose a remarkably simple and effective method, {\ourmethod}, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network's focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. {\ourmethod} can be easily integrated into most existing pixel-based MIM approaches (\ie, using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves four MIM approaches, MAE, MFF, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM framework. Code and models will be available.

URL: https://openreview.net/forum?id=qyfz0QrkqP

---

Title: A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions

Abstract: The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in machine learning. Focusing on continuous probability distributions in the Euclidean space $\mathbb{R}^d$, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametric statistical tool that measures the centrality of any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability distribution or a dataset. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Thus, its upper-level sets---the depth-trimmed regions---give rise to a definition of multivariate quantiles. The new pseudo-metric relies on the average of the Hausdorff distance between the depth-based quantile regions for each distribution. Its good behavior regarding major transformation groups, as well as its ability to factor out translations, are depicted. Robustness, an appealing feature of this pseudo-metric, is studied through the finite sample breakdown point. Moreover, we propose an efficient approximation method with linear time complexity w.r.t. the size of the dataset and its dimension. The quality of this approximation and the performance of the proposed approach are illustrated in numerical experiments.

URL: https://openreview.net/forum?id=QySD5r7PeE

---

Title: Variational autoencoder with weighted samples for high-dimensional non-parametric adaptive importance sampling

Abstract: Probability density function estimation with weighted samples is the main foundation of all adaptive importance sampling algorithms. Classically, a target distribution is approximated either by a non-parametric model or within a parametric family. However, these models suffer from the curse of dimensionality or from their lack of flexibility. In this contribution, we suggest to use as the approximating model a distribution parameterised by a variational autoencoder. We extend the existing framework to the case of weighted samples by introducing a new objective function. The flexibility of the obtained family of distributions makes it as expressive as a non-parametric model, and despite the very high number of parameters to estimate, this family is much more efficient in high dimension than the classical Gaussian or Gaussian mixture families. Moreover, in order to add flexibility to the model and to be able to learn multimodal distributions, we consider a learnable prior distribution for the variational autoencoder latent variables. We also introduce a new pre-training procedure for the variational autoencoder to find good starting weights of the neural networks to prevent as much as possible the posterior collapse phenomenon to happen. At last, we explicit how the resulting distribution can be combined with importance sampling, and we exploit the proposed procedure in existing adaptive importance sampling algorithms to draw points from a target distribution and to estimate a rare event probability in high dimension on two multimodal problems.

URL: https://openreview.net/forum?id=nzG9KGssSe

---

Title: Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?

Abstract: A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples versus an ability to predict the label via some method of generalisation. In the context of using Language Models for question-answering, discussion continues to occur as to the extent to which questions are answered through memorisation. We consider this issue for questions that would ideally be answered through reasoning over an associated context. We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers. Our method is based on semantic similarity of input tokens and label tokens between training and evaluation samples. We show that our method offers an advantage upon prior approaches in that it is able to surface evaluation-train pairs that have overlap in either contiguous or discontiguous sequences of tokens. We use this method to identify unmemorisable subsets of our evaluation datasets. We train two Language Models in a multitask fashion whereby the second model differs from the first only in that it has two additional datasets added to the training regime that are designed to impart simple numerical reasoning strategies of a sort known to improve performance on some of our evaluation datasets but not on others. We then show that there is performance improvement between the two models on the unmemorisable subsets of the evaluation datasets that were expected to benefit from the additional training datasets. Specifically, performance on unmemorisable subsets of two of our evaluation datasets, DROP and ROPES significantly improves by 9.0%, and 25.7% respectively while other evaluation datasets have no significant change in performance.

URL: https://openreview.net/forum?id=5IzdKyXVa0

---

Title: A Unified View of Differentially Private Deep Generative Modeling

Abstract: The availability of rich and vast data sources has greatly advanced machine learning applications in various domains. However, data with privacy concerns comes with stringent regulations that frequently prohibit data access and data sharing. Overcoming these obstacles in compliance with privacy considerations is key for technological progress in many real-world application scenarios that involve sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released, enabling privacy-preserving downstream analysis and reproducible research in sensitive domains. In recent years, various approaches have been proposed for achieving privacy-preserving high-dimensional data generation by private training on top of deep neural networks. In this paper, we present a novel unified view that systematizes these approaches. Our view provides a joint design space for systematically deriving methods that cater to different use cases. We then discuss the strengths, limitations, and inherent correlations between different approaches,
aiming to shed light on crucial aspects and inspire future research.
We conclude by presenting potential paths forward for the field of DP data generation, with the aim of steering the community toward making the next important steps in advancing privacy-preserving learning.

URL: https://openreview.net/forum?id=YgmBD2c9qX

---

Title: A Survey on Out-of-Distribution Detection in NLP

Abstract: Out-of-distribution (OOD) detection is essential for the reliable and safe deployment of machine learning systems in the real world. Great progress has been made over the past years. This paper presents the first review of recent advances in OOD detection with a particular focus on natural language processing approaches. First, we provide a formal definition of OOD detection and discuss several related fields. We then categorize recent algorithms into three classes according to the data they used: (1) OOD data available, (2) OOD data unavailable + in-distribution (ID) label available, and (3) OOD data unavailable + ID label unavailable. Third, we introduce datasets, applications, and metrics. Finally, we summarize existing work and present potential future research topics.

URL: https://openreview.net/forum?id=nYjSkOy8ij

---

Title: Fast Training of Diffusion Models with Masked Transformers

Abstract: We propose an efficient approach to train large diffusion models with masked transformers. While masked transformers have been extensively explored for representation learning, their application to generative learning is less explored in the vision domain. Our work is the first to exploit masked training to reduce the training cost of diffusion models significantly. Specifically, we randomly mask out a high proportion (\emph{e.g.}, 50%) of patches in diffused input images during training. For masked training, we introduce an asymmetric encoder-decoder architecture consisting of a transformer encoder that operates only on unmasked patches and a lightweight transformer decoder on full patches. To promote a long-range understanding of full patches, we add an auxiliary task of reconstructing masked patches to the denoising score matching objective that learns the score of unmasked patches. Experiments on ImageNet-256$\times$256 show that our approach achieves the similar performance (FID 2.28 with guidance) as the state-of-the-art Diffusion Transformer (DiT) model, using only 31\% of its original training time. Thus, our method shows a promising way of efficiently training large transformer-based diffusion models without sacrificing the generative performance.

URL: https://openreview.net/forum?id=vTBjBtGioE

---

Title: Pull-back Geometry of Persistent Homology Encodings

Abstract: Persistent homology (PH) is a method that allows to create topology-inspired representations of data. In contrast to data-driven representation methods, it is an expert-based method with explicit computations designed by hand. Nonetheless, owing to the complexity of these computations, its properties and interpretation have remained elusive. We investigate the spectrum of the Jacobian of PH data encodings and the pull-back geometry that they induce on the data manifold. Then, by measuring different perturbations and features on the data manifold with respect to this geometry, we can identify which of them are recognized or ignored by the PH encodings. This also allows us to compare different encodings in terms of their induced geometry. Importantly, the approach does not require training and testing on a particular task and permits a direct exploration of PH. We experimentally demonstrate that the pull-back norm can be used as a predictor of performance on downstream tasks and to select suitable PH encodings accordingly.

URL: https://openreview.net/forum?id=7yswRA8zzw

---

Title: Molecular De Novo Design through Transformer-based Reinforcement Learning

Abstract: In this work, we introduce a method to fine-tune a Transformer-based generative model for molecular de novo design. Leveraging the superior sequence learning capacity of Transformers over Recurrent Neural Networks (RNNs), our model can generate molecular structures with desired properties effectively. In contrast to the traditional RNN-based models, our proposed method exhibits superior performance in generating compounds predicted to be active against various biological targets, capturing long-term dependencies in the molecular structure sequence. The model's efficacy is demonstrated across numerous tasks, including generating analogues to a query structure and producing compounds with particular attributes, outperforming the baseline RNN-based methods. Our approach can be used for scaffold hopping, library expansion starting from a single molecule, and generating compounds with high predicted activity against biological targets.

URL: https://openreview.net/forum?id=3s8v6A14VN

---

Title: WaveBench: Benchmarks Datasets for Modeling Wave Propagation PDEs

Abstract: Wave-based imaging techniques play a critical role in diverse scientific, medical, and industrial endeavors, from discovering hidden structures beneath the Earth's surface to ultrasound diagnostics. They rely on accurate solutions to the forward and inverse problems for partial differential equations (PDEs) that govern wave propagation. Surrogate PDE solvers based on machine learning emerged as an effective approach to computing the solutions more efficiently than via classical numerical schemes. However, existing datasets for PDE surrogates offer only limited coverage of the wave propagation phenomenon. In this paper, we present WaveBench, a comprehensive collection of benchmark datasets for wave propagation PDEs. WaveBench (1) contains 24 datasets that cover a wide range of forward and inverse problems for time-harmonic and time-varying wave phenomena; (2) includes a user-friendly PyTorch environment for comparing learning-based methods; and (3) comprises reference performance and model checkpoints of popular PDE surrogates such as Fourier neural operators and U-Nets. Our evaluation on WaveBench demonstrates the impressive performance of PDE surrogates on in-distribution samples, while simultaneously unveiling their limitations on out-of-distribution samples, indicating room for future improvements. We anticipate that WaveBench will stimulate the development of accurate wave-based imaging techniques through machine learning.

URL: https://openreview.net/forum?id=6wpInwnzs8

---

Title: Prismer: A Vision-Language Model with Multi-Task Experts

Abstract: Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data.

URL: https://openreview.net/forum?id=R7H43YD6Lo

---

Title: Budgeted Online Model Selection and Fine-Tuning via Federated Learning

Abstract: Online model selection involves selecting a model from a set of candidate models `on the fly' to perform prediction on a stream of data. The choice of candidate models henceforth has a crucial impact on the performance. Although employing a larger set of candidate models naturally leads to more flexibility in model selection, this may be infeasible in cases where prediction tasks are performed on edge devices with limited memory. Faced with this challenge, the present paper proposes an online federated model selection framework where a group of learners (clients) interacts with a server with sufficient memory such that the server stores all candidate models. However, each client only chooses to store a subset of models that can be fit into its memory and performs its own prediction task using one of the stored models. Furthermore, employing the proposed algorithm, clients and the server collaborate to fine-tune models to adapt them to a non-stationary environment. Theoretical analysis proves that the proposed algorithm enjoys sub-linear regret with respect to the best model in hindsight. Experiments on real datasets demonstrate the effectiveness of the proposed algorithm.

URL: https://openreview.net/forum?id=WeiRR8h87X

---

Title: Evaluation of Causal Inference Models to Access Heterogeneous Treatment Effect

Abstract: Causal inference has gained popularity over the last years due to the ability to see through
correlation and find causal relationship between covariates. There are a number of methods
that were created to this end, but there is not a systematic benchmark between those
methods, including the benefits and drawbacks of using each one of them. This research
compares a number of those methods on how well they access the heterogeneous treatment
effect using a variety of synthetically created data sets, divided between low-dimensional
and high-dimensional covariates and increasing complexity between the covariates and the
target. We compare the error between those method and discuss in which setting and
premises each method is better suited.

URL: https://openreview.net/forum?id=0IqqmFWKIK

---

Title: Separability Analysis for Causal Discovery in Mixture of DAGs

Abstract: Directed acyclic graphs (DAGs) are effective for compactly representing causal systems and specifying the causal relationships among the system's constituents. However, in some systems, a single DAG is insufficient to specify all causal relationships, and a mixture of multiple DAGs is needed. Some examples include time-varying causal systems or aggregated subgroups of a population. This paper formalizes a framework for recovering the causal graph representing a mixture of DAGs. A major difference between single- versus mixture-DAG recovery is the existence of node pairs that are separable in the individual DAGs but become inseparable in their mixture. This paper provides the theoretical foundations for analyzing such inseparable node pairs. Specifically, the notion of \emph{emergent edges} is introduced to represent such inseparable pairs that do not exist in the single DAGs but emerge in their mixtures. Necessary conditions for identifying the emergent edges are established. Operationally, these conditions serve as sufficient conditions for separating a pair of nodes in the mixture of DAGs. These results are further extended, and matching necessary and sufficient conditions for identifying the emergent edges in tree-structured DAGs are established. Finally, a novel graphical representation is formalized to specify these conditions, and an algorithm is provided for inferring the learnable causal relations.

URL: https://openreview.net/forum?id=ALRWXT1RLZ

---

Title: Using Sum-Product Networks to Assess Uncertainty in Deep Active Learning

Abstract: The success of deep active learning hinges on the choice of an effective acquisition function, which ranks not yet labeled data points according to their expected informativeness. Many acquisition functions are (partly) based on the uncertainty that the current model has about the class label of a point, yet there is no generally agreed upon strategy for computing such uncertainty.
This paper proposes a new and very simple approach to computing uncertainty in deep active learning with a Convolutional Neural Network (CNN). The main idea is to use the feature representation extracted by the CNN as data for training a Sum-Product Network (SPN). Since SPNs are typically used for estimating the distribution of a dataset, they are well suited to the task of estimating class probabilities that can be used directly by standard acquisition functions such as max entropy and variational ratio.
The effectiveness of our method is demonstrated in an experimental study on several standard benchmark datasets for image classification, where we compare it to various state-of-the-art methods for assessing uncertainty in deep active learning.

URL: https://openreview.net/forum?id=Ai9XpjGxjl

---

Title: Regret-Optimal Federated Transfer Learning for Kernel Regression – with Applications in American Option Pricing

Abstract: We propose an optimal iterative scheme for federated transfer learning, where a central planner has access to datasets ${\cal D}_1,\dots,{\cal D}_N$ for the same learning model $f_{\theta}$. Our objective is to minimize the cumulative deviation of the generated parameters $\{\theta_i(t)\}_{t=0}^T$ across all $T$ iterations from the specialized parameters $\theta^\star_{1},\ldots,\theta^\star_N$ obtained for each dataset while respecting the loss function for the model $f_{\theta(T)}$ produced by the algorithm upon halting. We only allow for continual communication between each of the specialized models (nodes/agents) and the central planner (server) at each iteration (round). For the case where the model $f_{\theta}$ is a finite-rank kernel regression, we derive explicit updates for the regret-optimal algorithm. By leveraging symmetries within the regret-optimal algorithm, we further develop a nearly regret-optimal heuristic that runs with $\mathcal{O}(Np^2)$ fewer elementary operations, where $p$ is the dimension of the parameter space. Additionally, we investigate the adversarial robustness of the regret-optimal algorithm, showing that an adversary which perturbs $q$ training pairs by at most $\varepsilon>0$, across all training sets, cannot reduce the regret-optimal algorithm's regret by more than $\mathcal{O}(\varepsilon q \bar{N}^{1/2})$, where $\bar{N}$ is the aggregate number of training pairs. To validate our theoretical findings, we conduct numerical experiments in the context of American option pricing, utilizing a randomly generated finite-rank kernel.

URL: https://openreview.net/forum?id=lDYnlSXNx1

---

Title: Causal Testing of Representation Similarity Metrics

Abstract: Representation similarity metrics are widely used to compare learned representations in neural networks, as is evident in extensive literature investigating metrics that accurately capture information encoded in representations. However, aiming to capture all of the information available in representations may have little to do with what information is actually used by the downstream network. One solution is to experiment with causal measures of function. By ablating groups of units thought to carry information and observing whether those ablations affect network performance, we can focus on an outcome that causally links representations to function. In this paper, we systematically test representation similarity metrics to evaluate their sensitivity to causal functional changes induced by ablation. We use network performance changes after ablation as a way to causally measure the influence of representation on function. These measures of function allow us to test how well similarity metrics capture changes in network performance versus changes to linear decodability. Network performance measures index the information used by the downstream network, while linear decoding methods index available information in the representation. We show that all of the tested metrics are more sensitive to decodable features than network performance. When comparing these metrics, Procrustes and CKA outperform regularized CCA-based methods on average. Although Procrustes and CKA outperform on average, these metrics have a diminished advantage when looking at network performance. We provide causal tests of the utility of different representational similarity metrics. Our results suggest that interpretability methods will be more effective if they are based on representational similarity metrics that have been evaluated using causal tests.

URL: https://openreview.net/forum?id=YY2iA0hfia

---

Title: Bias/Variance is not the same as Approximation/Estimation

Abstract: We study the relation between two classical results: the bias-variance decomposition, and the approximation-estimation decomposition. Both are important conceptual tools in Machine Learning, helping us describe the nature of model fitting. It is commonly stated that the two decompositions ``are closely related'', or are ``similar in spirit''. However, sometimes it is said they ``are equivalent'' (spoiler: no, they're not). In reality, they have subtle connections, cutting across learning theory and classical statistics, that (very surprisingly) have not been previously observed. In this short paper we uncover these connections, and build a bridge between the two.

URL: https://openreview.net/forum?id=4TnFbv16hK

---

Title: Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

Abstract: Compared to on-policy counterparts, off-policy model-free deep reinforcement learning can improve data efficiency by repeatedly using the previously gathered data. However, off-policy learning becomes challenging when the discrepancy between the underlying distributions of the agent’s policy and collected data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories and induce additional problems such as vanishing/exploding gradients or discarding many useful experiences, which eventually increases the computational complexity. Moreover, their generalization to either continuous action domains or policies approximated by deterministic deep neural networks is strictly limited. To overcome these limitations, we introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control. Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks. Theoretical and empirical studies demonstrate that it can achieve a “safe” off-policy learning and substantially improve the state-of-the-art by attaining higher returns in fewer steps than the competing methods through an effective schedule of the learning rate in Q-learning and policy optimization.

URL: https://openreview.net/forum?id=CyjG4ZKCtE

---

Title: Introducing "Forecast Utterance" for Conversational Data Science

Abstract: Envision an intelligent agent capable of assisting users in conducting forecasting tasks through intuitive, natural conversations, without requiring in-depth knowledge of the underlying machine learning (ML) processes. A significant challenge for the agent in this endeavor is to accurately comprehend the user's prediction goals and, consequently, formulate precise ML tasks. In this paper, we take a pioneering step towards this ambitious goal by introducing a new concept called Forecast Utterance and then focus on the automatic and accurate interpretation of users' prediction goals from these utterances. Specifically, we frame the task as a slot-filling problem, where each slot corresponds to a specific aspect of the goal prediction task. We then employ two zero-shot methods for solving the slot-filling task, namely: 1) Entity Extraction (EE), and 2) Question-Answering (QA) techniques. Our experiments, conducted with three meticulously crafted data sets, validate the viability of our ambitious goal and demonstrate the effectiveness of both EE and QA techniques in interpreting Forecast Utterances.

URL: https://openreview.net/forum?id=H2EeStRTQn

---

Reply all
Reply to author
Forward
0 new messages