Weekly TMLR digest for Jun 18, 2023

2 views

Skip to first unread message

TMLR

unread,

Jun 17, 2023, 8:00:10 PM6/17/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: LEAD: Min-Max Optimization from a Physical Perspective

Reyhane Askari Hemmat, Amartya Mitra, Guillaume Lajoie, Ioannis Mitliagkas

https://openreview.net/forum?id=vXSsTYs6ZB

---

Survey Certification: On Averaging ROC Curves

Jack Hogan, Niall M. Adams

https://openreview.net/forum?id=FByH3qL87G

---

Accepted papers
===============

Title: Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Authors: Hiroki Naganuma, Kartik Ahuja, Shiro Takagi, Tetsuya Motokawa, Rio Yokota, Kohta Ishikawa, Ikuro Sato, Ioannis Mitliagkas

Abstract: Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution. While much promising work has been accomplished to address this fragility, a systematic study of the role of optimizers and their out-of-distribution generalization performance has not been undertaken. In this study, we examine the performance of popular first-order optimizers for different classes of distributional shift under empirical risk minimization and invariant risk minimization. We address this question for image and text classification using DomainBed, WILDS, and Backgrounds Challenge as testbeds for studying different types of shifts---namely correlation and diversity shift. We search over a wide range of hyperparameters and examine classification accuracy (in-distribution and out-of-distribution) for over 20,000 models. We arrive at the following findings, which we expect to be helpful for practitioners: i) adaptive optimizers (e.g., Adam) perform worse than non-adaptive optimizers (e.g., SGD, momentum SGD) on out-of-distribution performance. In particular, even though there is no significant difference in in-distribution performance, we show a measurable difference in out-of-distribution performance. ii) in-distribution performance and out-of-distribution performance exhibit three types of behavior depending on the dataset---linear returns, increasing returns, and diminishing returns. For example, in the training of natural language data using Adam, fine-tuning the performance of in-distribution performance does not significantly contribute to the out-of-distribution generalization performance.

URL: https://openreview.net/forum?id=ipe0IMglFF

---

Title: The Eigenlearning Framework: A Conservation Law Perspective on Kernel Ridge Regression and Wide Neural Networks

Authors: James B Simon, Madeline Dickens, Dhruva Karkada, Michael Deweese

Abstract: We derive simple closed-form estimates for the test risk and other generalization metrics of kernel ridge regression (KRR). Relative to prior work, our derivations are greatly simplified and our final expressions are more readily interpreted. In particular, we show that KRR can be interpreted as an explicit competition among kernel eigenmodes for a fixed supply of a quantity we term "learnability.'' These improvements are enabled by a sharp conservation law which limits the ability of KRR to learn any orthonormal basis of functions. Test risk and other objects of interest are expressed transparently in terms of our conserved quantity evaluated in the kernel eigenbasis. We use our improved framework to:
i) provide a theoretical explanation for the "deep bootstrap" of Nakkiran et al (2020),
ii) generalize a previous result regarding the hardness of the classic parity problem,
iii) fashion a theoretical tool for the study of adversarial robustness, and
iv) draw a tight analogy between KRR and a well-studied system in statistical physics.

URL: https://openreview.net/forum?id=FDbQGCAViI

---

Title: Unsupervised Domain Adaptation via Minimized Joint Error

Authors: Dexuan Zhang, Thomas Westfechtel, Tatsuya Harada

Abstract: Unsupervised domain adaptation transfers knowledge from a fully labeled source domain to a different target domain, where no labeled data are available. Some researchers have proposed upper bounds for the target error when transferring knowledge. For example, Ben-David et al. (2010) established a theory based on minimizing the source error and distance between marginal distributions simultaneously. However, in most research, the joint error is ignored because of its intractability. In this research, we argue that joint errors are essential for domain adaptation problems, particularly when the domain gap is large. To address this problem, we propose a novel objective related to the upper bound of the joint error. Moreover, we adopt a source/pseudo-target label-induced hypothesis space that can reduce the search space to further tighten this bound. To measure the dissimilarity between hypotheses, we define a novel cross-margin discrepancy to alleviate instability during adversarial learning. In addition, we present extensive empirical evidence showing that the proposed method boosts the performance of image classification accuracy on standard domain adaptation benchmarks.

URL: https://openreview.net/forum?id=kiPsMct7vL

---

Title: LEAD: Min-Max Optimization from a Physical Perspective

Authors: Reyhane Askari Hemmat, Amartya Mitra, Guillaume Lajoie, Ioannis Mitliagkas

Abstract: Adversarial formulations such as generative adversarial networks (GANs) have rekindled interest in two-player min-max games. A central obstacle in the optimization of such games is the rotational dynamics that hinder their convergence. In this paper, we show that game optimization shares dynamic properties with particle systems subject to multiple forces, and one can leverage tools from physics to improve optimization dynamics. Inspired by the physical framework, we propose LEAD, an optimizer for min-max games. Next, using Lyapunov stability theory and spectral analysis, we study LEAD’s convergence properties in continuous and discrete time settings for a class of quadratic min-max games to demonstrate linear convergence to the Nash equilibrium. Finally, we empirically evaluate our method on synthetic setups and CIFAR-10 image generation to demonstrate improvements in GAN training.

URL: https://openreview.net/forum?id=vXSsTYs6ZB

---

Title: On Averaging ROC Curves

Authors: Jack Hogan, Niall M. Adams

Abstract: Receiver operating characteristic (ROC) curves are a popular method of summarising the performance of classifiers. The ROC curve describes the separability of the distributions of predictions from a two-class classifier. There are a variety of situations in which an analyst seeks to aggregate multiple ROC curves into a single representative example. A number of methods of doing so are available; however, there is a degree of subtlety that is often overlooked when selecting the appropriate one. An important component of this relates to the interpretation of the decision process for which the classifier will be used. This paper summarises a number of methods of aggregation and carefully delineates the interpretations of each in order to inform their correct usage. A toy example is provided that highlights how an injudicious choice of aggregation method can lead to erroneous conclusions.

URL: https://openreview.net/forum?id=FByH3qL87G

---

Title: Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification

Authors: Niladri S. Chatterji, Saminul Haque, Tatsunori Hashimoto

Abstract: While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an undersampled balanced dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. In the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory, the test accuracy of robust neural network classifiers is constrained by the number of minority samples.

URL: https://openreview.net/forum?id=r6oHDYOZ6p

---

Title: On the Convergence and Calibration of Deep Learning with Differential Privacy

Authors: Zhiqi Bu, Hua Wang, Zongyu Dai, Qi Long

Abstract: Differentially private (DP) training preserves the data privacy usually at the cost of slower convergence (and thus lower accuracy), as well as more severe mis-calibration than its non-private counterpart. To analyze the convergence of DP training, we formulate a continuous time analysis through the lens of neural tangent kernel (NTK), which characterizes the per-sample gradient clipping and the noise addition in DP training, for arbitrary network architectures and loss functions. Interestingly, we show that the noise addition only affects the privacy risk but not the convergence or calibration, whereas the per-sample gradient clipping (under both flat and layerwise clipping styles) only affects the convergence and calibration.

Furthermore, we observe that while DP models trained with small clipping norm usually achieve the best accurate, but are poorly calibrated and thus unreliable. In sharp contrast, DP models trained with large clipping norm enjoy the same privacy guarantee and similar accuracy, but are significantly more \textit{calibrated}. Our code can be found at https://github.com/woodyx218/opacus_global_clipping.

URL: https://openreview.net/forum?id=K0CAGgjYS1

---

Title: A Reproducible and Realistic Evaluation of Partial Domain Adaptation Methods

Authors: Tiago Salvador, Kilian FATRAS, Ioannis Mitliagkas, Adam M Oberman

Abstract: Unsupervised Domain Adaptation (UDA) aims at classifying unlabeled target images leveraging source labeled ones. In the case of an extreme label shift scenario between the source and target domains, where we have extra source classes not present in the target domain, the UDA problem becomes a harder problem called Partial Domain Adaptation (PDA). While different methods have been developed to solve the PDA problem, most successful algorithms use model selection strategies that rely on target labels to find the best hyper-parameters and/or models along training. These strategies violate the main assumption in PDA: only unlabeled target domain samples are available. In addition, there are also experimental inconsistencies between developed methods - different architectures, hyper-parameter tuning, number of runs - yielding unfair comparisons. The main goal of this work is to provide a realistic evaluation of PDA methods under different model selection strategies and a consistent evaluation protocol. We evaluate 6 state-of-the-art PDA algorithms on 2 different real-world datasets using 7 different model selection strategies. Our two main findings are: (i) without target labels for model selection, the accuracy of the methods decreases up to 30 percentage points; (ii) only one method and model selection pair performs well on both datasets. Experiments were performed with our PyTorch framework, BenchmarkPDA, which we open source.

URL: https://openreview.net/forum?id=XcVzIBXeRn

---

New submissions
===============

Title: Revisiting Topic-Guided Language Models

Abstract: A recent line of work in natural language processing has aimed to combine language models and topic models. These \textit{topic-guided language models} augment neural language models with topic models, unsupervised learning methods that can discover document-level patterns of word use. This paper compares the effectiveness of these methods in a standardized setting. We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on three corpora. Surprisingly, we find that \textit{none of these methods outperform a standard LSTM language model baseline}, and most fail to learn good topics. Further, we train a probe of the neural language model that shows that the baseline's hidden states already encode topic information. We make public all code used for this study.

URL: https://openreview.net/forum?id=lXBEwFfxpA

---

Title: Homomorphic Self-Supervised Learning

Abstract: Many state of the art self-supervised learning approaches fundamentally rely on transformations applied to the input in order to selectively extract task-relevant information. Recently, the field of equivariant deep learning has developed to introduce structure into the feature space of deep neural networks by designing them as homomorphisms with respect to input transformations. In this work, we observe that many existing self-supervised learning algorithms can be both unified and generalized when seen through the lens of equivariant representations. Specifically, we introduce a general framework we call Homomorphic Self-Supervised Learning, and theoretically show how it may subsume the use of input-augmentations provided an augmentation-homomorphic feature extractor. We validate this theory experimentally for simple augmentations, demonstrate the necessity of representational structure for feature-space SSL, and further empirically explore how the parameters of this framework relate to those of traditional augmentation-based self-supervised learning. We conclude with a discussion of the potential benefits afforded by this new perspective on self-supervised learning.

URL: https://openreview.net/forum?id=tEKqQgbwbf

---

Title: Towards Stability of Autoregressive Neural Operators

Abstract: Neural operators have proven to be a promising approach for modeling spatiotemporal systems in the physical sciences. However, training these models for large systems can be quite challenging as they incur significant computational and memory expense---these systems are often forced to rely on autoregressive time-stepping of the neural network to predict future temporal states. While this is effective in managing costs, it can lead to uncontrolled error growth over time and eventual instability. We analyze the sources of this autoregressive error growth using prototypical neural operator models for physical systems and explore ways to mitigate it. We introduce architectural and application-specific improvements that allow for careful control of instability-inducing operations within these models without inflating the compute/memory expense. We present results on several scientific systems that include Navier-Stokes fluid flow, rotating shallow water, and a high-resolution global weather forecasting system. We demonstrate that applying our design principles to prototypical neural networks leads to significantly lower errors in long-range forecasts with 800\% longer forecasts without qualitative signs of divergence compared to the original models for these systems. We open-source our \href{https://anonymous.4open.science/r/stabilizing_neural_operators-5774/}{anonymized code} for reproducibility.

URL: https://openreview.net/forum?id=RFfUUtKYOG

---

Title: Learning domain-specific causal discovery from time series

Abstract: Causal discovery (CD) from time-varying data is important in neuroscience, medicine, and machine learning. Techniques for CD encompass randomized experiments, which are generally unbiased but expensive, and algorithms such as Granger causality, conditional-independence-based, structural-equation-based, and score-based methods that are only accurate under strong assumptions made by human designers. However, as demonstrated in other areas of machine learning, human expertise is often not entirely accurate and tends to be outperformed in domains with abundant data. In this study, we examine whether we can enhance domain-specific causal discovery for time series using a data-driven approach. Our findings indicate that this procedure significantly outperforms human-designed, domain-agnostic causal discovery methods, such as Mutual Information, VAR-LiNGAM, and Granger Causality on the MOS 6502 microprocessor, the NetSim fMRI dataset, and the Dream3 gene dataset. We argue that, when feasible, the causality field should consider a supervised approach in which domain-specific CD procedures are learned from extensive datasets with known causal relationships, rather than being designed by human specialists. Our findings promise a new approach toward improving CD in neural and medical data and for the broader machine learning community.

URL: https://openreview.net/forum?id=JFaZ94tT8M

---

Title: The inverse problem for kernel means

Abstract: We discuss the inverse problem for the kernel embedding of measures. We identify which elements of a reproducing kernel Hilbert space which are in the cone generated by some set of kernel functions as polar dual of the Herglotz-type functions, the functions with positive imaginary part. Over certain spaces, such as Sobelev spaces, the duality to Herglotz functions reduces to a classical multivariate moment problem, and, over analytic spaces, we see more complex analytic type conditions. We give conditions for when Herglotz functions have representations in terms of kernel functions in terms of reflexive reproducing kernel Hilbert spaces. We identify the orbits of a dynamical system in terms of the Koopmanism philosophy: we give a way to decide when there is an orbit contained in some compact subset of the domain.

URL: https://openreview.net/forum?id=u1mnESqKks

---

Title: CIGNN: A Causal Perspective for Semi-supervised Open-world Graph Classification

Abstract: Graph classification has gained growing attention in the graph machine learning community and a variety of semi-supervised methods have been developed to reduce the high cost of annotation. They usually combine graph neural networks (GNNs) and extensive semi-supervised techniques such as knowledge distillation. However, they adhere to the close-set assumption that unlabeled graphs all belong to known classes, limiting their applications in the real world. This paper goes further, investigating a practical problem of semi-supervised open-world graph classification where these unlabeled graph data could come from unseen classes. A novel approach named Casuality-Informed GNN (CIGNN) is proposed, which takes a causal look to detect components containing the most information related to the label space and classify unlabeled graphs into a known class or an unseen class. In particular, CIGNN contains a relational detector and a feature extractor to produce effective causal features, which maximize the mutual information with label information and exhibit sufficient disentanglement with non-causal elements. Furthermore, we construct a graph-of-graph based on geometrical relationships, which gives instructions on enhancing causal representations. In virtue of effective causal representations, we can provide accurate and balanced predictions for unlabeled graphs. An extension is also made to accomplish effective open-set graph classification. We verify our proposed methods on four benchmark datasets in various settings and experimental results reveal the effectiveness of our proposed CIGNN compared with state-of-the-art methods.

URL: https://openreview.net/forum?id=qcCE4mC2jI

---

Title: Expectation of the maximum of Normal random variables with applications to reinforcement learning

Abstract: We explore the application of expressions for the expected maxima of Normal random variables to compute the mean of the distribution of fixed points of the Bellman optimality equation for large state space Markov decision processes (MDPs), under a Bayesian framework. Current approaches to computing the statistics of the value functions in reinforcement learning rely on bounds and estimates that do not exploit statistical properties of the Bellman equation which can arise in the large system limit. Specifically, we utilise a recently developed mean field theory called dynamic mean field programming, which in principle allows us to compute exactly the the prior and posterior moments of the value functions, under certain conditions on the MDP structure and beliefs. Computing the solution to the mean field equations however relies on computing expected maxima, and current approaches are limited to identically distributed rewards. We apply expressions for expected maxima, in general settings, to compute the mean field solutions of the Bellman equation at the start of learning. We analyse the resulting approximations to the mean field equations, establishing Lyapunov stability and contractive properties.

URL: https://openreview.net/forum?id=ZvM6TVPBBM

---

Title: Zero-shot Node Classification with Graph Contrastive Embedding Network

Abstract: This paper studies zero-shot node classification, which aims to predict new classes (i.e., unseen classes) of nodes in a graph. This problem is challenging yet promising in a variety of real-world applications such as social analysis and bioinformatics. The key of zero-shot node classification is to enable the knowledge transfer of nodes from training classes to unseen classes. However, existing methods typically ignore the dependencies between nodes and classes, and fail to be organically integrated in a united way. In this paper, we present a novel framework called the Graph Contrastive Embedding Network (GraphCEN) for zero-shot node classification. Specifically, GraphCEN first constructs an affinity graph to model the relations between the classes. Then the node- and class-level contrastive learning (CL) are proposed to jointly learn node embeddings and class assignments in an end-to-end manner. The two-level CL can be optimized to mutually enhance each other. Extensive experiments indicate that our GraphCEN significantly outperforms the state-of-the-art approaches on three challenging benchmark datasets.

URL: https://openreview.net/forum?id=8wGXnjRLSy

---

Title: Intrinsic Motivation via Surprise Memory

Abstract: We present a new computing model for intrinsic rewards in reinforcement learning that addresses the limitations of existing surprise-driven explorations. The reward is the novelty of the surprise rather than the surprise norm. We estimate the surprise novelty as retrieval errors of a memory network wherein the memory stores and reconstructs surprises. Our surprise memory (SM) augments the capability of surprise-based intrinsic motivators, maintaining the agent's interest in exciting exploration while reducing unwanted attraction to unpredictable or noisy observations. Our experiments demonstrate that the SM combined with various surprise predictors exhibits efficient exploring behaviors and significantly boosts the final performance in sparse reward environments, including Noisy-TV, navigation and challenging Atari games.

URL: https://openreview.net/forum?id=qC487kbXlZ

---

Title: Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic

Abstract: This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.

URL: https://openreview.net/forum?id=3PbxuMNQkp

---

Title: A Multilinear Least-Squares Formulation for Sparse Tensor Canonical Correlation Analysis

Abstract: Tensor data are becoming important recently in various applications, e.g., image and video recognition, which pose new challenges for data modeling and analysis approaches, such as high-order relations of large complexity, varying data scale and gross noise. In this paper, we consider the problem of sparse canonical correlation analysis for arbitrary tensor data. Although several methods have been proposed for this task, there are still limitations hindering its practical applications. To this end, we present a general Sparse Tensor Canonical Correlation Analysis (gSTCCA) method from a multilinear least-squares perspective. Specifically, we formulate the problem as a constrained multilinear least-squares problem with tensor-structured sparsity regularization based on CANDECOMP/PARAFAC (CP) decomposition. Then we present a divide-and-conquer deflation approach to tackle the problem by successive rank-one tensor estimation of the residual tensors, where the overall model is broken up into a set of unconstrained linear least-squares problems which can be efficiently solved. Through extensive experiments conducted on five different datasets for recognition tasks, we demonstrate that the proposed method achieves promising performance compared to the SOTA vector- and tensor-based canonical correlation analysis methods in terms of classification accuracy, model sparsity and robustness to missing and noisy data. The code is publicly available at https://anonymous.4open.science/r/gSTCCA-151C.

URL: https://openreview.net/forum?id=zc0Y0cAuTV

---

Title: ContraSim – Analyzing Neural Representations Based on Contrastive Learning

Abstract: Recent work has compared neural network representations via similarity-based analyses to improve model interpretation. The quality of a similarity measure is typically evaluated by its success in assigning a high score to representations that are expected to be matched. However, existing similarity measures perform mediocrely on standard benchmarks. In this work, we develop a new similarity measure, dubbed ContraSim, based on contrastive learning. In contrast to common closed-form similarity measures, ContraSim learns a parameterized measure by using both similar and dissimilar examples. We perform an extensive experimental evaluation of our method, with both language and vision models, on the standard layer prediction benchmark and two new benchmarks that we introduce: the multilingual benchmark and the image–caption benchmark. In all cases, ContraSim achieves much higher accuracy than previous similarity measures, even when presented with challenging examples. Finally, ContraSim is more suitable for the analysis of neural networks, revealing new insights not captured by previous measures.

URL: https://openreview.net/forum?id=ILzCLWKLbk

---

Title: Online model selection by learning how compositional kernels evolve

Abstract: Motivated by the need for efficient, personalized learning in health, we investigate the problem of online compositional kernel selection for multi-task Gaussian Process regression. Existing composition selection methods do not satisfy our strict criteria in health; selection must occur quickly, and the selected kernels must maintain the appropriate level of complexity, sparsity, and stability as data arrives online. We introduce the Kernel Evolution Model (KEM), a generative process on how to evolve kernel compositions in a way that manages the bias--variance trade-off as we observe more data about a user. Using pilot data, we learn a set of kernel evolutions that can be used to quickly select kernels for new test users. KEM reliably selects high-performing kernels for a range of synthetic and real data sets, including two health data sets.

URL: https://openreview.net/forum?id=23WZFQBUh5

---

Title: Modular Deep Learning

Abstract: Transfer learning has recently become the dominant paradigm of machine learning. Pre-trained models fine-tuned for downstream tasks achieve better performance with fewer labelled examples. Nonetheless, it remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference and that generalise systematically to non-identically distributed tasks. Modular deep learning has emerged as a promising solution to these challenges. In this framework, units of computation are often implemented as autonomous parameter-efficient modules. Information is conditionally routed to a subset of modules and subsequently aggregated. These properties enable positive transfer and systematic generalisation by separating computation from routing and updating modules locally. We offer a survey of modular architectures, providing a unified view over several threads of research that evolved independently in the scientific literature. Moreover, we explore various additional purposes of modularity, including scaling language models, causal inference and discovery, programme simulation, and hierarchical reinforcement learning. Finally, we report various concrete applications where modularity has been successfully deployed such as cross-lingual and cross-modal knowledge transfer.

URL: https://openreview.net/forum?id=z9EkXfvxta

---

Reply all

Reply to author

Forward

0 new messages