Weekly TMLR digest for Jun 11, 2023

10 views

Skip to first unread message

TMLR

unread,

Jun 10, 2023, 8:00:11 PM6/10/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: How to Reuse and Compose Knowledge for a Lifetime of Tasks: A Survey on Continual Learning and Functional Composition

Jorge A Mendez, ERIC EATON

https://openreview.net/forum?id=VynY6Bk03b

---

Accepted papers
===============

Title: Attentional-Biased Stochastic Gradient Descent

Authors: Qi Qi, Yi Xu, Wotao Yin, Rong Jin, Tianbao Yang

Abstract: In this paper, we present a simple yet effective provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. The individual-level weight of a sampled data is systematically proportional to the exponential of a scaled loss value of the data, where the scaling factor is interpreted as the regularization parameter in the framework of distributionally robust optimization (DRO). Depending on whether the scaling factor is positive or negative, ABSGD is guaranteed to converge to a stationary point of an information-regularized min-max or min-min DRO problem, respectively. Compared with existing class-level weighting schemes, our method can capture the diversity between individual examples within each class. Compared with existing individual-level weighting methods using meta-learning that require three backward propagations for computing mini-batch stochastic gradients, our method is more efficient with only one backward propagation at each iteration as in standard deep learning methods. ABSGD is flexible enough to combine with other robust losses without any additional cost. Our empirical studies on several benchmark datasets demonstrate the effectiveness of the proposed method.

URL: https://openreview.net/forum?id=B0WYWvVA2r

---

Title: Reinforcement Teaching

Authors: Calarina Muslimani, Alex Lewandowski, Dale Schuurmans, Matthew E. Taylor, Jun Luo

Abstract: Machine learning algorithms learn to solve a task, but are unable to improve their ability to learn.
Meta-learning methods learn about machine learning algorithms and improve them so that they learn more quickly. However, existing meta-learning methods are either hand-crafted to improve one specific component of an algorithm or only work with differentiable algorithms.
We develop a unifying meta-learning framework, called \textit{Reinforcement Teaching}, to improve the learning process of \emph{any} algorithm. Under Reinforcement Teaching, a teaching policy is learned, through reinforcement, to improve a student's learning algorithm. To learn an effective teaching policy, we introduce the \textit{parametric-behavior embedder} that learns a representation of the student's learnable parameters from its input/output behavior. We further use \textit{learning progress} to shape the teacher's reward, allowing it to more quickly maximize the student's performance. To demonstrate the generality of Reinforcement Teaching, we conduct experiments in which a teacher learns to significantly improve both reinforcement and supervised learning algorithms. Reinforcement Teaching outperforms previous work using heuristic reward functions and state representations, as well as other parameter representations.

URL: https://openreview.net/forum?id=G2GKiicaJI

---

Title: Test-Time Adaptation for Visual Document Understanding

Authors: Sayna Ebrahimi, Sercan O Arik, Tomas Pfister

Abstract: For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling, as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We introduce new benchmarks using existing public datasets for various VDU tasks, including entity recognition, key-value extraction, and document visual question answering. DocTTA shows significant improvements on these compared to the source model performance, up to 1.89\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively.

URL: https://openreview.net/forum?id=zshemTAa6U

---

Title: Learning to Incentivize Improvements from Strategic Agents

Authors: Yatong Chen, Jialu Wang, Yang Liu

Abstract: Machine learning systems are often used in settings where individuals adapt their features to obtain a desired outcome.
In such settings, strategic behavior leads to a sharp loss in model performance in deployment. In this work, we aim to address this problem by learning classifiers that encourage decision subjects to change their features in a way that leads to improvement in both predicted and true outcome. We frame the dynamics of prediction and adaptation as a two-stage game, and characterize optimal strategies for the model designer and its decision subjects. In benchmarks on simulated and real-world datasets, we find that classifiers trained using our method maintain the accuracy of existing approaches while inducing higher levels of improvement and less manipulation.

URL: https://openreview.net/forum?id=W98AEKQ38Y

---

Title: Finding Competence Regions in Domain Generalization

Authors: Jens Müller, Stefan T. Radev, Robert Schmier, Felix Draxler, Carsten Rother, Ullrich Koethe

Abstract: We investigate a "learning to reject" framework to address the problem of silent failures in Domain Generalization (DG), where the test distribution differs from the training distribution. Assuming a mild distribution shift, we wish to accept out-of-distribution (OOD) data from a new domain whenever a model's estimated competence foresees trustworthy responses, instead of rejecting OOD data outright. Trustworthiness is then predicted via a proxy incompetence score that is tightly linked to the performance of a classifier. We present a comprehensive experimental evaluation of existing proxy scores as incompetence scores for classification and highlight the resulting trade-offs between rejection rate and accuracy gain. For comparability with prior work, we focus on standard DG benchmarks and consider the effect of measuring incompetence via different learned representations in a closed versus an open world setting. Our results suggest that increasing incompetence scores are indeed predictive of reduced accuracy, leading to significant improvements of the average accuracy below a suitable incompetence threshold. However, the scores are not yet good enough to allow for a favorable accuracy/rejection trade-off in all tested domains. Surprisingly, our results also indicate that classifiers optimized for DG robustness do not outperform a naive Empirical Risk Minimization (ERM) baseline in the competence region, that is, where test samples elicit low incompetence scores.

URL: https://openreview.net/forum?id=TSy0vuwQFN

---

Title: Noise-robust Graph Learning by Estimating and Leveraging Pairwise Interactions

Authors: Xuefeng Du, Tian Bian, Yu Rong, Bo Han, Tongliang Liu, Tingyang Xu, Wenbing Huang, Yixuan Li, Junzhou Huang

Abstract: Teaching Graph Neural Networks (GNNs) to accurately classify nodes under severely noisy labels is an important problem in real-world graph learning applications, but is currently underexplored. Although pairwise training methods have demonstrated promise in supervised metric learning and unsupervised contrastive learning, they remain less studied on noisy graphs, where the structural pairwise interactions (PI) between nodes are abundant and thus might benefit label noise learning rather than the pointwise methods. This paper bridges the gap by proposing a pairwise framework for noisy node classification on graphs, which relies on the PI as a primary learning proxy in addition to the pointwise learning from the noisy node class labels. Our proposed framework PI-GNN contributes two novel components: (1) a confidence-aware PI estimation model that adaptively estimates the PI labels, which are defined as whether the two nodes share the same node labels, and (2) a decoupled training approach that leverages the estimated PI labels to regularize a node classification model for robust node classification. Extensive experiments on different datasets and GNN architectures demonstrate the effectiveness of PI-GNN, yielding a promising improvement over the state-of-the-art methods. Code is publicly available at https://github.com/TianBian95/pi-gnn.

URL: https://openreview.net/forum?id=r7imkFEAQb

---

Title: 3D-Aware Video Generation

Authors: Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, Hao Tang, Gordon Wetzstein, Leonidas Guibas, Luc Van Gool, Radu Timofte

Abstract: Generative models have emerged as an essential building block for many image synthesis and editing tasks. Recent advances in this field have also enabled high-quality 3D or video content to be generated that exhibits either multi-view or temporal consistency. With our work, we explore 4D generative adversarial networks (GANs) that learn unconditional generation of 3D-aware videos. By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos. We show that our method learns a rich embedding of decomposable 3D structures and motions that enables new visual effects of spatio-temporal renderings while producing imagery with quality comparable to that of existing 3D or video GANs.

URL: https://openreview.net/forum?id=SwlfyDq6B3

---

Title: Bounded Space Differentially Private Quantiles

Authors: Daniel Alabi, Omri Ben-Eliezer, Anamay Chaturvedi

Abstract: Estimating the quantiles of a large dataset is a fundamental problem in both the streaming algorithms literature and the differential privacy literature. However, all existing private mechanisms for distribution-independent quantile computation require space at least linear in the input size $n$. In this work, we devise a differentially private algorithm for the quantile estimation problem, with strongly sublinear space complexity, in the one-shot and continual observation settings. Our basic mechanism estimates any $\alpha$-approximate quantile of a length-$n$ stream over a data universe $\mathcal{X}$ with probability $1-\beta$ using $O\left( \frac{\log (|\mathcal{X}|/\beta) \log (\alpha \epsilon n)}{\alpha \epsilon} \right)$ space while satisfying $\epsilon$-differential privacy at a single time point. Our approach builds upon deterministic streaming algorithms for non-private quantile estimation instantiating the exponential mechanism using a utility function defined on sketch items, while (privately) sampling from intervals defined by the sketch. We also present another algorithm based on histograms that is especially well-suited to the multiple quantiles case. We implement our algorithms and experimentally evaluate them on synthetic and real-world datasets.

URL: https://openreview.net/forum?id=sixOD8YVvM

---

Title: The Stack: 3 TB of permissively licensed source code

Authors: Denis Kocetkov, Raymond Li, Loubna Ben allal, Jia LI, Chenghao Mou, Yacine Jernite, Margaret Mitchell, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro Von Werra, Harm de Vries

Abstract: Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" for developers to search The Stack for copies of their code (https://hf.co/spaces/bigcode/in-the-stack), and provide a process for code to be removed from the dataset.

URL: https://openreview.net/forum?id=pxpbTdUEpD

---

Title: Exploring the Approximation Capabilities of Multiplicative Neural Networks for Smooth Functions

Authors: Ido Ben-Shaul, Tomer Galanti, Shai Dekel

Abstract: Multiplication layers are a key component in various influential neural network modules, including self-attention and hypernetwork layers. In this paper, we investigate the approximation capabilities of deep neural networks with intermediate neurons connected by simple multiplication operations. We consider two classes of target functions: generalized bandlimited functions, which are frequently used to model real-world signals with finite bandwidth, and Sobolev-Type balls, which are embedded in the Sobolev Space $\mathcal{W}^{r,2}$. Our results demonstrate that multiplicative neural networks can approximate these functions with significantly fewer layers and neurons compared to standard ReLU neural networks, with respect to both input dimension and approximation error. These findings suggest that multiplicative gates can outperform standard feed-forward layers and have potential for improving neural network design.

URL: https://openreview.net/forum?id=sWQJfb2GSk

---

Title: Assuming Locally Equal Calibration Errors for Non-Parametric Multiclass Calibration

Authors: Kaspar Valk, Meelis Kull

Abstract: A probabilistic classifier is considered calibrated if it outputs probabilities equal to the expected class distribution given the classifier's output. Calibration is essential in safety-critical tasks where small deviations between the predicted probabilities and the actually observed class proportions can incur high costs. A common approach to improve the calibration of a classifier is to use a hold-out data set and a post-hoc calibration method to learn a correcting transformation for the classifier's output. This work explores the field of post-hoc calibration methods for multi-class classifiers and formulates two assumptions about the probability simplex which have been used by many existing non-parametric calibration methods, but despite this, have never been explicitly stated: assuming locally equal label distributions or assuming locally equal calibration errors. Based on the latter assumption, an intuitive non-parametric post-hoc calibration method is proposed, which is shown to offer improvements to the state-of-the-art according to the expected calibration error metric on CIFAR-10 and CIFAR-100 data sets.

URL: https://openreview.net/forum?id=na5sHG69rI

---

Title: Learning Graph Structure from Convolutional Mixtures

Authors: Max Wasserman, Saurabh Sihag, Gonzalo Mateos, Alejandro Ribeiro

Abstract: Machine learning frameworks such as graph neural networks typically rely on a given, fixed graph to exploit relational inductive biases and thus effectively learn from network data. However, when said graphs are (partially) unobserved, noisy, or dynamic, the problem of inferring graph structure from data becomes relevant. In this paper, we postulate a graph convolutional relationship between the observed and latent graphs, and formulate the graph structure learning task as a network inverse (deconvolution) problem. In lieu of eigendecomposition-based spectral methods or iterative optimization solutions, we unroll and truncate proximal gradient iterations to arrive at a parameterized neural network architecture that we call a Graph Deconvolution Network (GDN). GDNs can learn a distribution of graphs in a supervised fashion, perform link prediction or edge-weight regression tasks by adapting the loss function, and they are inherently inductive as well as node permutation equivariant. We corroborate GDN's superior graph learning performance and its generalization to larger graphs using synthetic data in supervised settings. Moreover, we demonstrate the robustness and representation power of GDNs on real world neuroimaging and social network datasets.

URL: https://openreview.net/forum?id=OILbP0WErR

---

Title: Learning Object-Centric Neural Scattering Functions for Free-viewpoint Relighting and Scene Composition

Authors: Hong-Xing Yu, Michelle Guo, Alireza Fathi, Yen-Yu Chang, Eric Ryan Chan, Ruohan Gao, Thomas Funkhouser, Jiajun Wu

Abstract: Photorealistic object appearance modeling from 2D images is a constant topic in vision and graphics. While neural implicit methods (such as Neural Radiance Fields) have shown high-fidelity view synthesis results, they cannot relight the captured objects. More recent neural inverse rendering approaches have enabled object relighting, but they represent surface properties as simple BRDFs, and therefore cannot handle translucent objects. We propose Object-Centric Neural Scattering Functions (OSFs) for learning to reconstruct object appearance from only images. OSFs not only support free-viewpoint object relighting, but also can model both opaque and translucent objects. While accurately modeling subsurface light transport for translucent objects can be highly complex and even intractable for neural methods, OSFs learn to approximate the radiance transfer from a distant light to an outgoing direction at any spatial location. This approximation avoids explicitly modeling complex subsurface scattering, making learning a neural implicit model tractable. Experiments on real and synthetic data show that OSFs accurately reconstruct appearances for both opaque and translucent objects, allowing faithful free-viewpoint relighting as well as scene composition. In our supplementary material, we include a video for an overview. Project website with video results: https://kovenyu.com/OSF/

URL: https://openreview.net/forum?id=NrfSRtTpN5

---

Title: How to Reuse and Compose Knowledge for a Lifetime of Tasks: A Survey on Continual Learning and Functional Composition

Authors: Jorge A Mendez, ERIC EATON

Abstract: A major goal of artificial intelligence (AI) is to create an agent capable of acquiring a general understanding of the world. Such an agent would require the ability to continually accumulate and build upon its knowledge as it encounters new experiences. Lifelong or continual learning addresses this setting, whereby an agent faces a continual stream of problems and must strive to capture the knowledge necessary for solving each new task it encounters. If the agent is capable of accumulating knowledge in some form of compositional representation, it could then selectively reuse and combine relevant pieces of knowledge to construct novel solutions. Despite the intuitive appeal of this simple idea, the literatures on lifelong learning and compositional learning have proceeded largely separately. In an effort to promote developments that bridge between the two fields, this article surveys their respective research landscapes and discusses existing and future connections between them.

URL: https://openreview.net/forum?id=VynY6Bk03b

---

Title: Contextualize Me – The Case for Context in Reinforcement Learning

Authors: Carolin Benjamins, Theresa Eimer, Frederik Schubert, Aditya Mohan, Sebastian Döhler, André Biedenkapp, Bodo Rosenhahn, Frank Hutter, Marius Lindauer

Abstract: While Reinforcement Learning ( RL) has made great strides towards solving increasingly
complicated problems, many algorithms are still brittle to even slight environmental changes.
Contextual Reinforcement Learning (cRL) provides a framework to model such changes in
a principled manner, thereby enabling flexible, precise and interpretable task specification
and generation. Our goal is to show how the framework of cRL contributes to improving
zero-shot generalization in RL through meaningful benchmarks and structured reasoning
about generalization tasks. We confirm the insight that optimal behavior in cRL requires
context information, as in other related areas of partial observability. To empirically validate
this in the cRL framework, we provide various context-extended versions of common RL
environments. They are part of the first benchmark library, CARL, designed for generalization
based on cRL extensions of popular benchmarks, which we propose as a testbed to further
study general agents. We show that in the contextual setting, even simple RL environments
become challenging - and that naive solutions are not enough to generalize across complex
context spaces.

URL: https://openreview.net/forum?id=Y42xVBQusn

---

Title: Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees

Authors: Johanna Vielhaben, Stefan Bluecher, Nils Strodthoff

Abstract: The completeness axiom renders the explanation of a post-hoc eXplainable AI (XAI) method only locally faithful to the model, i.e. for a single decision. For the trustworthy application of XAI, in particular for high-stake decisions, a more global model understanding is required. To this end, concept-based methods have been proposed, which are however not guaranteed to be bound to the actual model reasoning. To circumvent this problem, we propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. Our method starts from general linear subspaces as concepts and does neither require reinforcing concept interpretability nor re-training of model parts. We propose sparse subspace clustering to discover improved concepts and fully leverage the potential of multi-dimensional subspaces. MCD offers two complementary analysis tools for concepts in input space: (1) concept activation maps, that show where a concept is expressed within a sample, allowing for concept characterization through prototypical samples, and (2) concept relevance heatmaps, that decompose the model decision into concept contributions. Both tools together enable a detailed global understanding of the model reasoning, which is guaranteed to relate to the model via a completeness relation. Thus, MCD paves the way towards more trustworthy concept-based XAI. We empirically demonstrate the superiority of MCD against more constrained concept definitions.

URL: https://openreview.net/forum?id=KxBQPz7HKh

---

Title: Dr-Fairness: Dynamic Data Ratio Adjustment for Fair Training on Real and Generated Data

Authors: Yuji Roh, Weili Nie, De-An Huang, Steven Euijong Whang, Arash Vahdat, Anima Anandkumar

Abstract: Fair visual recognition has become critical for preventing demographic disparity. A major cause of model unfairness is the imbalanced representation of different groups in training data. Recently, several works aim to alleviate this issue using generated data. However, these approaches often use generated data to obtain similar amounts of data across groups, which is not optimal for achieving high fairness due to different learning difficulties and generated data qualities across groups. To address this issue, we propose a novel adaptive sampling approach that leverages both real and generated data for fairness. We design a bilevel optimization that finds the optimal data sampling ratios among groups and between real and generated data while training a model. The ratios are dynamically adjusted considering both the model's accuracy as well as its fairness. To efficiently solve our non-convex bilevel optimization, we propose a simple approximation to the solution given by the implicit function theorem. Extensive experiments show that our framework achieves state-of-the-art fairness and accuracy on the CelebA and ImageNet People Subtree datasets. We also observe that our method adaptively relies less on the generated data when it has poor quality. Our work shows the importance of using generated data together with real data for improving model fairness.

URL: https://openreview.net/forum?id=TyBd56VK7z

---

New submissions
===============

Title: Regret Bounds for Satisficing in Multi-Armed Bandit Problems

Abstract: This paper considers the objective of \textit{satisficing} in multi-armed bandit problems. Instead of aiming to find an optimal arm, the learner is content with an arm whose reward is above a given satisfaction level. We provide algorithms and analysis for the realizable case when such a satisficing arm exists as well as for the general case when this may not be the case. Introducing the notion of \textit{satisficing regret}, our main result shows that in the general case it is possible to obtain constant satisficing regret when there is a satisficing arm (thereby correcting a contrary claim in the literature), while standard logarithmic regret bounds can be re-established otherwise. Experiments illustrate that our algorithm is not only superior to standard algorithms in the satisficing setting, but also works well in the classic bandit setting.

URL: https://openreview.net/forum?id=QnT41ZGNh9

---

Title: Faithful Knowledge Distillation

Abstract: Knowledge distillation (KD) has received much attention due to its success in compressing networks to allow for their deployment in resource-constrained systems. While the problem of adversarial robustness has been studied before in the KD setting, previous works overlook what we term the relative calibration of the student network with respect to its teacher in terms of soft confidences. In particular, we focus on two crucial questions with regard to a teacher-student pair: (i) do the teacher and student disagree at points close to correctly classified dataset examples, and (ii) is the distilled student as confident as the teacher around dataset examples? These are critical questions when considering the deployment of a smaller student network trained from a robust teacher within a safety-critical setting. To address these questions, we introduce a faithful imitation framework to discuss the relative calibration of confidences, as well as provide empirical and certified methods to evaluate the relative calibration of a student w.r.t. its teacher. Further, to verifiably align the relative calibration incentives of the student to those of its teacher, we introduce faithful distillation. Our experiments on the MNIST and Fashion-MNIST datasets demonstrate the need for such an analysis and the advantages of the increased verifiability of faithful distillation over alternative adversarial distillation methods.

URL: https://openreview.net/forum?id=IlBSr94R3j

---

Title: Learning to Boost Resilience of Complex Networks via Neural Edge Rewiring

Abstract: The resilience of complex networks refers to their ability to maintain functionality in the face of structural attacks. This ability can be improved by performing minimal modifications to the network structure via degree-preserving edge rewiring-based methods. Existing learning-free edge rewiring methods, although effective, are limited in their ability to generalize to different graphs. Such a limitation cannot be trivially addressed by existing graph neural networks (GNNs)-based learning approaches since there is no rich initial node features for GNNs to learn meaningful representations. In this work, inspired by persistent homology, we specifically design a variant of GNN called FireGNN, specifically designed to learn meaningful node representations solely from graph structures. We then develop an end-to-end inductive method called ResiNet, which aims to discover resilient network topologies while balancing network utility. ResiNet reformulates the optimization of network resilience as a Markov decision process equipped with edge rewiring action space. It learns to sequentially select the appropriate edges to rewire for maximizing resilience. Extensive experiments demonstrate that ResiNet outperforms existing approaches and achieves near-optimal resilience gains on various graphs while balancing network utility.

URL: https://openreview.net/forum?id=moZvOx5cxe

---

Title: Disciplined Saddle Programming

Abstract: We consider convex-concave saddle point problems, and more generally convex optimization problems we refer to as saddle problems, which include the partial supremum or infimum of convex-concave saddle functions. Saddle problems arise in a wide range of applications, including game theory, machine learning, and finance. It is well known that a saddle problem can be reduced to a single convex optimization problem by dualizing either the convex (min) or concave (max) objectives, reducing a min-max problem into a min-min (or max-max) problem. Carrying out this conversion by hand can be tedious and error prone. In this paper we introduce disciplined saddle programming (DSP), a domain specific language (DSL) for specifying saddle problems, for which the dualizing trick can be automated. The language and methods are based on recent work by Juditsky and Nemirovski, who developed the idea of conic-representable saddle point programs, and showed how to carry out the required dualization automatically using conic duality. Juditsky and Nemirovski's conic representation of saddle problems extends Nesterov and Nemirovski's earlier development of conic representable convex problems; DSP can be thought of as extending disciplined convex programming (DCP) to saddle problems. Just as DCP makes it easy for users to formulate and solve complex convex problems, DSP allows users to easily formulate and solve saddle problems. Our method is implemented in an open-source package, also called DSP.

URL: https://openreview.net/forum?id=KhMLfEIoUm

---

Title: Finding Neurons in a Haystack: Case Studies with Sparse Probing

Abstract: Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale. With $k=1$, we localize individual neurons that are highly relevant for a particular feature and perform a number of case studies to illustrate general properties of LLMs. In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics.
In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.

URL: https://openreview.net/forum?id=JYs1R9IMJr

---

Title: Cross-client Label Propagation for Transductive and Semi-Supervised Federated Learning

Abstract: We present Cross-Client Label Propagation (XCLP), a new method for transductive and semi-supervised federated learning. XCLP estimates a data graph jointly from the data of multiple clients and computes labels for the unlabeled data by propagating label information across the graph. To avoid clients having to share their data with anyone, XCLP employs two cryptographically secure protocols: secure Hamming distance computation and secure summation. We demonstrate two distinct applications of XCLP within federated learning. In the first, we use it in a one-shot way to predict labels for unseen test points. In the second, we use it to repeatedly pseudo-label unlabeled training data in a federated semi-supervised setting. Experiments on both real federated and standard benchmark datasets show that in both applications XCLP achieves higher classification accuracy than alternative approaches.

URL: https://openreview.net/forum?id=gY04GX8R5k

---

Title: Inducing Reusable Skills From Demonstrations with Option-Controller Network

Abstract: Humans can decompose previous experiences into skills and reuse them to enable fast learning in the future. Inspired by this process, we propose a new model called Option-Controller Network (OCN), which is a bi-level recurrent policy network composed of a high-level controller and a pool of low-level options. The options are disconnected from any task-specific information to model task-agnostic skills. And the controller uses options to solve a given task. With the isolation of information and the synchronous calling mechanism, we can impose a division of work between the controller and options in an end-to-end training regime. In experiments, we first perform behavior cloning from unstructured demonstrations of different tasks. We then freeze the learned options and learn a new controller to solve a new task. Extensive results on discrete and continuous environments show that OCN can jointly learn to decompose unstructured demonstrations into skills and model each skill with separate options. The learned options provide a good temporal abstraction, allowing OCN to quickly transfer to tasks with a novel combination of learned skills even with sparse reward, while previous methods suffer from the delayed reward problem due to the lack of temporal abstraction or a complicated option-controlling mechanism.

URL: https://openreview.net/forum?id=t65EAJNeI1

---

Title: Uncertainty Estimation for Computed Tomography with a Linearised Deep Image Prior

Abstract: Existing deep-learning-based tomographic image reconstruction methods do not provide accurate estimates of reconstruction uncertainty, hindering their real-world deployment. This paper develops a method, termed as the linearised deep image prior (DIP), to estimate the uncertainty associated with reconstructions produced by the DIP with total variation regularisation (TV). Specifically, we endow the DIP with conjugate Gaussian-linear model type error-bars computed from a local linearisation of the neural network around its optimised parameters. To preserve conjugacy, we approximate the TV regulariser with a Gaussian surrogate. This approach provides pixel-wise uncertainty estimates and a marginal likelihood objective for hyperparameter optimisation. We demonstrate the method on synthetic data and real-measured high-resolution 2D mu-CT data, and show that it provides superior calibration of uncertainty estimates relative to previous probabilistic formulations of the~DIP. Our code is available at https://github.com/anonymooseBayesDIP/bayes_dip.

URL: https://openreview.net/forum?id=FWyabz82fH

---

Title: Granger-Causal Hierarchical Skill Discovery

Abstract: Reinforcement Learning (RL) has shown promising results learning policies for complex tasks, but can often suffer from low sample efficiency and limited transfer. We introduce the Hierarchy of Interaction Skills (HIntS) algorithm, which uses learned interaction detectors to discover and train a hierarchy of skills that manipulate factors in factored environments. Inspired by Granger causality, these unsupervised detectors capture key events between factors to sample efficiently learn useful skills and transfer those skills to other related tasks---tasks where many reinforcement learning techniques struggle. We evaluate HIntS on a robotic pushing task with obstacles---a challenging domain where other RL and HRL methods fall short. The learned skills not only demonstrate transfer using variants of Breakout, a common RL benchmark, but also show 2-3x improvement in both sample efficiency and final performance compared to comparable RL baselines. Together, HIntS demonstrates a proof of concept for using Granger-causal relationships for skill discovery.

URL: https://openreview.net/forum?id=rRiwLPHhcK

---

Title: On the Sample Complexity of Lipschitz Learning Algorithms

Abstract: Estimating the Lipschitz constant of a function, also known as Lipschitz learning, is a fundamental problem with broad applications in fields such as control and global optimization. In this paper, we study the Lipschitz learning problem with minimal parametric assumptions on the target function. As a first theoretical contribution, we derive novel lower bounds on the sample complexity of this problem for both noise-free and noisy settings under mild assumptions. Moreover, we propose a simple Lipschitz learning algorithm called \textit{Lipschitz Constant Estimation by Least Squares Regression} (referred to as LCLS). We show that LCLS is asymptotically consistent and offers finite sample guarantees that can be translated to new upper bounds on the sample complexity of the Lipschitz learning problem. Our analysis shows that the sample complexity of LCLS is optimal in the general noise-free setting. Furthermore, we show that by design, the LCLS algorithm is computationally faster than existing theoretically consistent methods, and can be readily adapted to various noise assumptions with little to no prior knowledge of the target function properties or noise distribution.

URL: https://openreview.net/forum?id=UIalYAHdBH

---

Title: CAREER: Transfer Learning for Economic Prediction of Labor Sequence Data

Abstract: Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although modern machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a transformer-based model that uses transfer learning to learn representations of job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and adjust its representations on longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences, achieving state-of-the-art predictive performance on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables. For example, incorporating CAREER into a wage model provides better predictions than the econometric models currently in use.

URL: https://openreview.net/forum?id=4i1MXH8Sle

---

Title: ILPO-MP: Mode Priors Prevent Mode Collapse when Imitating Latent Policies from Observations

Abstract: Imitation learning from observations (IfO) constrains the classic imitation learning setting to cases where expert observations are easy to obtain, but no expert actions are available. Most existing IfO methods require access to task-specific cost functions or many interactions with the target environment. Learning a forward dynamics model in combination with a latent policy has been shown to solve these issues. However, the limited supervision in the IfO scenario can lead to mode collapse when learning the generative forward dynamics model and the corresponding latent policy. In this paper, we analyze the mode collapse problem in this setting and show that it is caused by a combination of deterministic expert data and bad initialization of the models. Under the assumption of piecewise continuous system dynamics, we propose ILPO-MP, a method to prevent the mode collapse using clustering of expert transitions to impose a mode prior on the generative model and the latent policy. We show that ILPO-MP prevents mode collapse and improves performance in a variety of environments.

URL: https://openreview.net/forum?id=f3JLnnZsAm

---

Title: DCP: Learning Accelerator Dataflow for Neural Network via Propagation

Abstract: Deep neural network (DNN) hardware (HW) accelerators have achieved great success in improving DNNs' performance and efficiency. One key reason is dataflow in executing a DNN layer, including on-chip data partitioning, computation parallelism, and scheduling policy,
which have large impacts on latency and energy consumption. Unlike prior works that required considerable efforts from HW engineers to design suitable dataflows for different DNNs, this work proposes an efficient data-centric approach, named \Fname (\Aname), to automatically find the optimal dataflow for DNN layers in seconds without human effort. It has several attractive benefits that prior arts do not have. (i) We translate the HW dataflow configuration into a code representation in a unified dataflow coding space, which can be optimized by back-propagating gradients given a DNN layer or network. (ii) \Aname learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives (\eg, latency and energy). (iii) It can be easily generalized to unseen HW configurations in a zero-shot or few-shot learning manner. For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples. Extensive experiments on several representative models such as MobileNet, ResNet, and ViT show that DCP outperforms its counterparts in various settings.

URL: https://openreview.net/forum?id=BNIrElhRl4

---

Title: Find Your Friends: Personalized Federated Learning with the Right Collaborators

Abstract: In the traditional federated learning setting, a central server coordinates a network of clients to train one global model. However, the global model may serve many clients poorly due to data heterogeneity. This problem can be mitigated when participating clients learn personalized models that can better serve their own needs. By noting that each client’s distribution can be represented as a mixture of all clients’ distributions, we derive a principled algorithm based on expectation maximization. Our framework, FedeRiCo, estimates the utilities of other participants’ models on each client’s data so that everyone can select the right collaborators for learning. As a result, each client can learn as much or as little from other clients as is optimal for its local data distribution. Additionally, we theoretically analyze the convergence of FedeRiCo and empirically demonstrate its communication efficiency even in the fully decentralized setting. Our algorithm outperforms other federated, personalized, and/or decentralized approaches on several benchmark datasets, being the only approach that consistently performs better than training with local data alone.

URL: https://openreview.net/forum?id=GCFOJfYnCn

---

Title: Bayesian Optimization on the Cartesian Product of Weighted Graphs to Better Search Discrete Spaces with Irregular Increments

Abstract: Bayesian optimization is a powerful tool for optimizing a black-box function on a compact Euclidean space under a limited evaluation budget. However, in practice, we may want to optimize over discretization of the solution space. For example, in scientific and engineering problems the discretization of the solution space naturally occurs due to measurement precision or standardized parts. In this work, we consider the problem of optimizing a black-box function with a discretized solution space. To address this problem, prior work uses Bayesian optimization on the Cartesian product of graphs. We extend this work to weighted edges which allow us to exploit the problem structure more effectively. Our proposed method outperforms earlier methods in diverse experiments including neural architecture search benchmarks and physics-based simulations with discretized solution spaces. We also investigate the impact of adding multi-hop edges to weighted graphs, which improves performance of our method on the optimization of synthetic benchmark functions.

URL: https://openreview.net/forum?id=DPptk6FSAi

---

Title: Understanding convolution on graphs via energies

Abstract: Graph Neural Networks (GNNs) typically operate by message-passing, where the state of a node is updated based on the information received from its neighbours. Most message-passing models act as graph convolutions, where features are mixed by a shared, linear transformation before being propagated over the edges. On node-classification tasks, graph convolutions have been shown to suffer from two limitations: poor performance on heterophilic graphs, and over-smoothing. It is common belief that both phenomena occur because such models behave as low-pass filters, meaning that the Dirichlet energy of the features decreases along the layers incurring a smoothing effect that ultimately makes features no longer distinguishable. In this work, we rigorously prove that simple graph-convolutional models can actually enhance high frequencies and even lead to an asymptotic behaviour we refer to as over-sharpening, opposite to over-smoothing. We do so by showing that linear graph convolutions with symmetric weights minimize a multi-particle energy that generalizes the Dirichlet energy; in this setting, the weight matrices induce edge-wise attraction (repulsion) through their positive (negative) eigenvalues, thereby controlling whether the features are being smoothed or sharpened. We also extend the analysis to non-linear GNNs, and demonstrate that some existing time-continuous GNNs are instead always dominated by the low frequencies. Finally, we validate our theoretical findings through ablations and real-world experiments.

URL: https://openreview.net/forum?id=v5ew3FPTgb

---

Title: Achieving Risk Control in Online Learning Settings

Abstract: To provide rigorous uncertainty quantification for online learning models, we develop a framework for constructing uncertainty sets that provably control risk---such as coverage of confidence intervals, false negative rate, or F1 score---in the online setting. This extends conformal prediction to apply to a larger class of online learning problems. Our method guarantees risk control at any user-specified level even when the underlying data distribution shifts drastically, even adversarially, over time in an unknown fashion.
The technique we propose is highly flexible as it can be applied with any base online learning algorithm (e.g., a deep neural network trained online), requiring minimal implementation effort and essentially zero additional computational cost.
We further extend our approach to control multiple risks simultaneously, so the prediction sets we generate are valid for all given risks.
To demonstrate the utility of our method, we conduct experiments on real-world tabular time-series data sets showing that the proposed method rigorously controls various natural risks.
Furthermore, we show how to construct valid intervals for an online image-depth estimation problem that previous sequential calibration schemes cannot handle.

URL: https://openreview.net/forum?id=5Y04GWvoJu

---

Title: On the special role of class-selective neurons in early training

Abstract: It is commonly observed that deep networks trained for classification exhibit class-selective neurons in their early and intermediate layers. Intriguingly, recent studies have shown that these class-selective neurons can be ablated without deteriorating network function. But if class-selective neurons are not necessary, why do they exist? We attempt to answer this question in a series of experiments on ResNet-50s trained on ImageNet. We first show that class-selective neurons emerge during the first few epochs of training, before receding rapidly but not completely; this suggests that class-selective neurons found in trained networks are in fact vestigial remains of early training. With single-neuron ablation experiments, we then show that class-selective neurons are important for network function in this early phase of training. We also observe that the network is close to a linear regime in this early phase; we thus speculate that class-selective neurons appear early in training as quasi-linear shortcut solutions to the classification task. Finally, in causal experiments where we regularize against class selectivity at different points in training, we show that the presence of class-selective neurons early in training is critical to the successful training of the network; in contrast, class-selective neurons can be suppressed later in training with little effect on final accuracy. It remains to be understood by which mechanism the presence of class-selective neurons in the early phase of training contributes to the successful training of networks.

URL: https://openreview.net/forum?id=JaNlH6dZYk

---

Title: Meta-Learning an Approximate Inference Algorithm for Low- Level Probabilistic Programs

Abstract: We present a meta-algorithm for learning an approximate posterior-inference algorithm for
low-level probabilistic programs that terminate. Our meta-algorithm takes a training set
of probabilistic programs that describe models with observations, and attempts to learn
an efficient method for inferring the posterior of a similar program. A key feature of our
approach is the use of what we call a white-box inference algorithm that extracts information
directly from model descriptions themselves, given as programs. Concretely, our white-box
inference algorithm is equipped with multiple neural networks, one for each type of atomic
command, and computes an approximate posterior of a given probabilistic program by
analysing individual atomic commands in the program using these networks. The parameters
of the networks are learnt from a training set of programs by our meta-algorithm. We
empirically demonstrate that the learnt inference algorithm generalises well to programs
that are new in terms of both parameters and model structures, and report important use
cases where our approach, in combination with importance sampling (IS), achieves greater
test-time efficiency than alternatives such as HMC. The overall results show the promise as
well as remaining challenges.

URL: https://openreview.net/forum?id=wsMFmhGDyp

---

Title: Optimistic Optimization of Gaussian Process Samples

Abstract: Bayesian optimization is a popular formalism for global optimization, but its computational costs limit it to expensive-to-evaluate functions. A competing, computationally more effi- cient, global optimization framework is optimistic optimization, which exploits prior knowl- edge about the geometry of the search space in form of a dissimilarity function. We investi- gate to which degree the conceptual advantages of Bayesian Optimization can be combined with the computational efficiency of optimistic optimization. By mapping the kernel to a dissimilarity, we obtain an optimistic optimization algorithm for the Bayesian Optimization setting with a run-time of up to $O(N log N )$. As a high-level take-away we find that, when using stationary kernels on objectives of low evaluation cost, optimistic optimization can be preferable over Bayesian optimization, while for strongly coupled and parametric models, Bayesian optimization can perform much better, even at low evaluation cost. As a concep- tual takeaway, our results demonstrate that balancing exploration and exploitation under Gaussian process assumptions does not require computing a posterior.

URL: https://openreview.net/forum?id=KQ5jI19kF3

---

Title: The NTK approximation is valid for longer than you think

Abstract: We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of $\alpha = O(T)$ suffices for the NTK approximation to be valid until training time $T$. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger rescaling factor of $\alpha = O(T^2)$.

URL: https://openreview.net/forum?id=qM7JPBYROr

---

Title: Discretization Invariant Networks for Learning Maps between Neural Fields

Abstract: With the emergence of powerful representations of continuous data in the form of neural fields, there is a need for discretization invariant learning: an approach for learning maps between functions on continuous domains without being sensitive to how the function is sampled. We present a new framework for understanding and designing discretization invariant neural networks (DI-Nets), which generalizes many discrete networks such as convolutional neural networks as well as continuous networks such as neural operators. Our analysis establishes upper bounds on the deviation in model outputs under different finite discretizations, and highlights the central role of point set discrepancy in characterizing such bounds. This insight leads to the design of a family of neural networks driven by numerical integration via quasi-Monte Carlo sampling with discretizations of low discrepancy. We prove by construction that DI-Nets universally approximate a large class of maps between integrable function spaces, and show that discretization invariance also describes backpropagation through such models. Applied to neural fields, convolutional DI-Nets can learn to classify and segment visual data under various discretizations, and sometimes generalize to new types of discretizations at test time.

URL: https://openreview.net/forum?id=CpYBAqDgmz

---

Reply all

Reply to author

Forward

0 new messages