Weekly TMLR digest for Apr 02, 2023

5 views

Skip to first unread message

TMLR

unread,

Apr 1, 2023, 8:00:08 PM4/1/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: Partition-Based Active Learning for Graph Neural Networks

Jiaqi Ma, Ziqiao Ma, Joyce Chai, Qiaozhu Mei

https://openreview.net/forum?id=e0xaRylNuT

---

Accepted papers
===============

Title: ChemSpacE: Interpretable and Interactive Chemical Space Exploration

Authors: Yuanqi Du, Xian Liu, Nilay Mahesh Shah, Shengchao Liu, Jieyu Zhang, Bolei Zhou

Abstract: Discovering meaningful molecules in the vast combinatorial chemical space has been a long-standing challenge in many fields, from materials science to drug design. Recent progress in machine learning, especially with generative models, shows great promise for automated molecule synthesis. Nevertheless, most molecule generative models remain black-boxes, whose utilities are limited by a lack of interpretability and human participation in the generation process. In this work, we propose \textbf{Chem}ical \textbf{Spac}e \textbf{E}xplorer (ChemSpacE), a simple yet effective method for exploring the chemical space with pre-trained deep generative models. Our method enables users to interact with existing generative models and steer the molecule generation process. We demonstrate the efficacy of ChemSpacE on the molecule optimization task and the latent molecule manipulation task in single-property and multi-property settings. On the molecule optimization task, the performance of ChemSpacE is on par with previous black-box optimization methods yet is considerably faster and more sample efficient. Furthermore, the interface from ChemSpacE facilitates human-in-the-loop chemical space exploration and interactive molecule design. Code and demo are available at \url{https://github.com/yuanqidu/ChemSpacE}.

URL: https://openreview.net/forum?id=C1Xl8dYCBn

---

Title: A Free Lunch with Influence Functions? An Empirical Evaluation of Influence Functions for Average Treatment Effect Estimation

Authors: Matthew James Vowels, Sina Akbari, Necati Cihan Camgoz, Richard Bowden

Abstract: The applications of causal inference may be life-critical, including the evaluation of vaccinations, medicine, and social policy. However, when undertaking estimation for causal inference, practitioners rarely have access to what might be called `ground-truth' in a supervised learning setting, meaning the chosen estimation methods cannot be evaluated and must be assumed to be reliable. It is therefore crucial that we have a good understanding of the performance consistency of typical methods available to practitioners. In this work we provide a comprehensive evaluation of recent semiparametric methods (including neural network approaches) for average treatment effect estimation. Such methods have been proposed as a means to derive unbiased causal effect estimates and statistically valid confidence intervals, even when using otherwise non-parametric, data-adaptive machine learning techniques. We also propose a new estimator `MultiNet', and a variation on the semiparametric update step `MultiStep', which we evaluate alongside existing approaches. The performance of both semiparametric and `regular' methods are found to be dataset dependent, indicating an interaction between the methods used, the sample size, and nature of the data generating process. Our experiments highlight the need for practitioners to check the consistency of their findings, potentially by undertaking multiple analyses with different combinations of estimators.

URL: https://openreview.net/forum?id=dQxBRqCjLr

---

Title: Partition-Based Active Learning for Graph Neural Networks

Authors: Jiaqi Ma, Ziqiao Ma, Joyce Chai, Qiaozhu Mei

Abstract: We study the problem of semi-supervised learning with Graph Neural Networks (GNNs) in an active learning setup. We propose GraphPart, a novel partition-based active learning approach for GNNs. GraphPart first splits the graph into disjoint partitions and then selects representative nodes within each partition to query. The proposed method is motivated by a novel analysis of the classification error under realistic smoothness assumptions over the graph and the node features. Extensive experiments on multiple benchmark datasets demonstrate that the proposed method outperforms existing active learning methods for GNNs under a wide range of annotation budget constraints. In addition, the proposed method does not introduce additional hyperparameters, which is crucial for model training, especially in the active learning setting where a labeled validation set may not be available.

URL: https://openreview.net/forum?id=e0xaRylNuT

---

Title: Clustering using Approximate Nearest Neighbour Oracles

Authors: Enayat Ullah, Harry Lang, Raman Arora, Vladimir Braverman

Abstract: We study the problem of clustering data points in a streaming setting when one has access to the geometry of the space only via approximate nearest neighbour (ANN) oracles. In this setting, we present algorithms for streaming $O(1)$-approximate $k$-median clustering and its (streaming) coreset construction. In certain domains of interest, such as spaces with constant expansion, our algorithms improve upon the best-known runtime of both these problems. Furthermore, our results extend to cost functions satisfying the approximate triangle inequality, which subsumes $k$-means clustering and $M$-estimators. Finally, we run experiments on Census1990 dataset wherein the results empirically support our theory.

URL: https://openreview.net/forum?id=TzRXyO3CzX

---

Title: Bayesian Optimization with Informative Covariance

Authors: Afonso Eduardo, Michael U. Gutmann

Abstract: Bayesian optimization is a methodology for global optimization of unknown and expensive objectives. It combines a surrogate Bayesian regression model with an acquisition function to decide where to evaluate the objective. Typical regression models are given by Gaussian processes with stationary covariance functions. However, these functions are unable to express prior input-dependent information, including possible locations of the optimum. The ubiquity of stationary models has led to the common practice of exploiting prior information via informative mean functions. In this paper, we highlight that these models can perform poorly, especially in high dimensions. We propose novel informative covariance functions for optimization, leveraging nonstationarity to encode preferences for certain regions of the search space and adaptively promote local exploration during optimization. We demonstrate that the proposed functions can increase the sample efficiency of Bayesian optimization in high dimensions, even under weak prior information.

URL: https://openreview.net/forum?id=JwgVBv18RG

---

Title: Turning Normalizing Flows into Monge Maps with Geodesic Gaussian Preserving Flows

Authors: Guillaume Morel, Lucas Drumetz, Simon Benaïchouche, Nicolas Courty, François Rousseau

Abstract: Normalizing Flows (NF) are powerful likelihood-based generative models that are able to trade off between expressivity and tractability to model complex densities. A now well established research avenue leverages optimal transport (OT) and looks for Monge maps, i.e. models with minimal effort between the source and target distributions. This paper introduces a method based on Brenier's polar factorization theorem to transform any trained NF into a more OT-efficient version without changing the final density. We do so by learning a rearrangement of the source (Gaussian) distribution that minimizes the OT cost between the source and the final density. The Gaussian preserving transformation is implemented with the construction of high dimensional divergence free functions and the path leading to the estimated Monge map is further constrained to lie on a geodesic in the space of volume-preserving diffeomorphisms thanks to Euler's equations. The proposed method leads to smooth flows with reduced OT costs for several existing models without affecting the model performance.

URL: https://openreview.net/forum?id=2UQv8L1Cv9

---

Title: Graph Neural Networks Designed for Different Graph Types: A Survey

Authors: Josephine Thomas, Alice Moallemy-Oureh, Silvia Beddar-Wiesing, Clara Holzhüter

Abstract: Graphs are ubiquitous in nature and can therefore serve as models for many practical but also
theoretical problems. For this purpose, they can be defined as many different types which
suitably reflect the individual contexts of the represented problem. To address cutting-edge
problems based on graph data, the research field of Graph Neural Networks (GNNs) has
emerged. Despite the field’s youth and the speed at which new models are developed, many
recent surveys have been published to keep track of them. Nevertheless, it has not yet
been gathered which GNN can process what kind of graph types. In this survey, we give a
detailed overview of already existing GNNs and, unlike previous surveys, categorize them
according to their ability to handle different graph types and properties. We consider GNNs
operating on static and dynamic graphs of different structural constitutions, with or without
node or edge attributes. Moreover, we distinguish between GNN models for discrete-time or
continuous-time dynamic graphs and group the models according to their architecture. We
find that there are still graph types that are not or only rarely covered by existing GNN
models. We point out where models are missing and give potential reasons for their absence.

URL: https://openreview.net/forum?id=h4BYtZ79uy

---

Title: Generalization bounds for Kernel Canonical Correlation Analysis

Authors: Enayat Ullah, Raman Arora

Abstract: We study the problem of multiview representation learning using kernel canonical correlation analysis (KCCA) and establish non-asymptotic bounds on generalization error for regularized empirical risk minimization. In particular, we give fine-grained high-probability bounds on generalization error ranging from $O(n^{-1/6})$ to $O(n^{-1/5})$ depending on underlying distributional properties, where $n$ is the number of data samples. For the special case of finite-dimensional Hilbert spaces (such as linear CCA), our rates improve, ranging from $O(n^{-1/2})$ to $O(n^{-1})$. Finally, our results generalize to the problem of functional canonical correlation analysis over abstract Hilbert spaces.

URL: https://openreview.net/forum?id=KwWKB9Bqam

---

Title: Learning Identity-Preserving Transformations on Data Manifolds

Authors: Marissa Catherine Connor, Kion Fallah, Christopher John Rozell

Abstract: Many machine learning techniques incorporate identity-preserving transformations into their models to generalize their performance to previously unseen data. These transformations are typically selected from a set of functions that are known to maintain the identity of an input when applied (e.g., rotation, translation, flipping, and scaling). However, there are many natural variations that cannot be labeled for supervision or defined through examination of the data. As suggested by the manifold hypothesis, many of these natural variations live on or near a low-dimensional, nonlinear manifold. Several techniques represent manifold variations through a set of learned Lie group operators that define directions of motion on the manifold. However, these approaches are limited because they require transformation labels when training their models and they lack a method for determining which regions of the manifold are appropriate for applying each specific operator. We address these limitations by introducing a learning strategy that does not require transformation labels and developing a method that learns the local regions where each operator is likely to be used while preserving the identity of inputs. Experiments on MNIST and Fashion MNIST highlight our model's ability to learn identity-preserving transformations on multi-class datasets. Additionally, we train on CelebA to showcase our model's ability to learn semantically meaningful transformations on complex datasets in an unsupervised manner.

URL: https://openreview.net/forum?id=gyhiZYrk5y

---

Title: A Halfspace-Mass Depth-Based Method for Adversarial Attack Detection

Authors: Marine Picot, Federica Granese, Guillaume Staerman, Marco Romanelli, Francisco Messina, Pablo Piantanida, Pierre Colombo

Abstract: Despite the widespread use of deep learning algorithms, vulnerability to adversarial attacks is still an issue limiting their use in critical applications. Detecting these attacks is thus crucial to build reliable algorithms and has received increasing attention in the last few years.
In this paper, we introduce the HalfspAce Mass dePth dEtectoR (HAMPER), a new method to detect adversarial examples by leveraging the concept of data depths, a statistical notion that provides center-outward ordering of points with respect to (w.r.t.) a probability distribution. In particular, the halfspace-mass (HM) depth exhibits attractive properties such as computational efficiency, which makes it a natural candidate for adversarial attack detection in high-dimensional spaces. Additionally, HM is non differentiable making it harder for attackers to directly attack HAMPER via gradient based-methods. We evaluate HAMPER in the context of supervised adversarial attacks detection across four benchmark datasets.
Overall, we empirically show that HAMPER consistently outperforms SOTA methods. In particular, the gains are 13.1% (29.0%) in terms of AUROC (resp. FPR) on SVHN, 14.6% (25.7%) on CIFAR10 and 22.6% (49.0%) on CIFAR100 compared to the best performing method.

URL: https://openreview.net/forum?id=YtU0nDb5e8

---

Title: Reusable Options through Gradient-based Meta Learning

Authors: David Kuric, Herke van Hoof

Abstract: Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding a reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.

URL: https://openreview.net/forum?id=qdDmxzGuzu

---

Title: Containing a spread through sequential learning: to exploit or to explore?

Authors: Xingran Chen, Hesam Nikpey, Jungyeol Kim, Saswati Sarkar, Shirin Saeedi Bidokhti

Abstract: The spread of an undesirable contact process, such as an infectious disease (e.g. COVID-19), is contained through testing and isolation of infected nodes. The temporal and spatial evolution of the process (along with containment through isolation) render such detection as fundamentally different from active search detection strategies. In this work, through an active learning approach, we design testing and isolation strategies to contain the spread and minimize the cumulative infections under a given test budget. We prove that the objective can be optimized, with performance guarantees, by greedily selecting the nodes to test. We further design reward-based methodologies that effectively minimize an upper bound on the cumulative infections and are computationally more tractable in large networks. These policies, however, need knowledge about the nodes' infection probabilities which are dynamically changing and have to be learned by sequential testing. We develop a message-passing framework for this purpose and, building on that, show novel tradeoffs between exploitation of knowledge through reward-based heuristics and exploration of the unknown through a carefully designed probabilistic testing. The tradeoffs are fundamentally distinct from the classical counterparts under active search or multi-armed bandit problems (MABs). We provably show the necessity of exploration in a stylized network and show through simulations that exploration can outperform exploitation in various synthetic and real-data networks depending on the parameters of the network and the spread.

URL: https://openreview.net/forum?id=qvRWcDXBam

---

Title: Bidirectional View based Consistency Regularization for Semi-Supervised Domain Adaptation

Authors: Yuntao Du, 娟江, Hongtao Luo, Haiyang Yang, MingCai Chen, Chongjun Wang

Abstract: Distinguished from unsupervised domain adaptation (UDA), semi-supervised domain adaptation (SSDA) could access a few labeled target samples during learning additionally. Although achieving remarkable progress, target supervised information is easily overwhelmed by massive source supervised information, as there are many more labeled source samples than those in the target domain. In this work, we propose a novel method BVCR that better utilizes the supervised information by three schemes, i.e., modeling, exploration, and interaction. In the modeling scheme, BVCR models the source supervision and target supervision separately to avoid target supervised information being overwhelmed by source supervised information and better utilize the target supervision. Besides, as both supervised information naturally offer distinct views for the target domain, the exploration scheme performs intra-domain consistency regularization to better explore target information with bidirectional views. Moreover, as both views are complementary to each other, the interaction scheme introduces inter-domain consistency regularization to activate information interaction bidirectionally. Thus, the proposed method is elegantly symmetrical by design and easy to implement. Extensive experiments are conducted, and the results show the effectiveness of the proposed method.

URL: https://openreview.net/forum?id=WVwnccBJLz

---

New submissions
===============

Title: ReMIX: Regret Minimization for Monotonic Value Function Factorization in Multi-Agent Reinforcement Learning

Abstract: Value function factorization methods have become a dominant approach for cooperative multi-agent reinforcement learning under a centralized training and decentralized execution paradigm. By factorizing the optimal joint action-value function using a monotonic mixing function of agents' utilities, these algorithms ensure the consistency between joint and local action selections for decentralized decision-making. Nevertheless, the use of monotonic mixing functions also induces representational limitations. Finding the optimal projection of an unrestricted mixing function onto monotonic function classes is still an open problem. To this end, we propose ReMIX, formulating this optimal projection problem for value function factorization as a regret minimization over the projection weights of different state-action values. Such an optimization problem can be relaxed and solved using the Lagrangian multiplier method to obtain the close-form optimal projection weights. By minimizing the resulting policy regret, we can narrow the gap between the optimal and the restricted monotonic mixing functions, thus obtaining an improved monotonic value function factorization. Our experimental results on Predator-Prey and StarCraft Multi-Agent Challenge environments demonstrate the effectiveness of our method, indicating the better capabilities of handling environments with non-monotonic value functions.

URL: https://openreview.net/forum?id=mPoQWsrSkW

---

Title: A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range

Abstract: We make contributions towards improving adaptive-optimizer performance. Our improvements are based on suppression of the range of adaptive stepsizes in the AdaBelief optimizer. Firstly, we show that the particular placement of the parameter $\epsilon$ within the update expressions of AdaBelief reduces the range of the adaptive stepsizes, making AdaBelief closer to SGD with momentum. Secondly, we extend AdaBelief by further suppressing the range of the adaptive stepsizes. To achieve the above goal, we perform mutual layerwise vector projections between the gradient $\boldsymbol{g}_t$ and its first momentum $\boldsymbol{m}_t$ before using them to estimate the second momentum. The new optimization method is referred to as \emph{Aida}. Thirdly, extensive experimental results show that Aida outperforms nine optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the nine methods when training WGAN-GP models for image generation tasks. Furthermore, Aida produces higher validation accuracies than AdaBelief for training ResNet18 over ImageNet.

URL: https://openreview.net/forum?id=VI2JjIfU37

---

Title: Online Min-max Problems: Nonconvexity and Saddle Point

Abstract: Online min-max optimization has recently gained considerable interest due to its rich applications to game theory, multi-agent reinforcement learning, online robust learning, etc. Theoretical understanding in this field has been mainly focused on convex-concave settings. Online min-max optimization with nonconvex geometries, which captures various online deep learning problems, has yet been studied so far. In this paper, we make the first effort and investigate online nonconvex-strongly-concave min-max optimization in the nonstationary environment. We first introduce a natural notion of local Nash equilibrium (NE)-regret, and then propose a novel algorithm coined SODA to achieve the optimal regret. We further generalize our study to the setting with stochastic first-order feedback, and show that a variation of SODA can also achieve the same optimal regret in expectation. Our theoretical results and the superior performance of the proposed method are further validated by empirical experiments. To our best knowledge, this is the first exploration of efficient online nonconvex min-max optimization.

URL: https://openreview.net/forum?id=TdzQtbLeVw

---

Title: Learning from time-dependent streaming data with online stochastic algorithms

Abstract: This paper addresses the problem of stochastic optimization in a streaming setting, where the objective function must be minimized using only time-dependent and biased estimates of its gradients. The study presents a non-asymptotic analysis of various Stochastic Gradient (SG) based methods, including the well-known SG descent, mini-batch SG, and time-varying mini-batch SG methods, as well as their iterated averages (Polyak-Ruppert averaging). The analysis establishes novel heuristics that link dependence, biases, and convexity levels, which allow for the acceleration of convergence of SG-based methods. Specifically, the heuristics demonstrate how SG-based methods can overcome long- and short-range dependencies and biases. In particular, the use of time-varying mini-batches counteracts dependency structures and biases while ensuring convexity, and combining this with Polyak-Ruppert averaging further accelerates convergence. These heuristics are particularly useful for learning problems with highly dependent data, noisy variables, and lacking convexity. Our results are validated through experiments using simulated and real-life data.

URL: https://openreview.net/forum?id=kdfiEu1ul6

---

Title: UGAE: A Novel Approach to Non-exponential Discounting

Abstract: The discounting mechanism in Reinforcement Learning determines the relative importance of future and present rewards.
While exponential discounting is widely used in practice, non-exponential discounting methods that align with human behavior are often desirable for creating human-like agents.
However, non-exponential discounting methods cannot be directly applied in modern on-policy actor-critic algorithms like PPO.
To address this issue, we propose Universal Generalized Advantage Estimation (UGAE), which allows for the computation of GAE advantages with arbitrary discounting.
Additionally, we introduce Beta-weighted discounting, a continuous interpolation between exponential and hyperbolic discounting, to increase flexibility in choosing a discounting method.
To showcase the utility of UGAE, we provide an analysis of the properties of various discounting methods.
We also show experimentally that agents with non-exponential discounting trained via UGAE outperform variants trained with Monte Carlo advantage estimation.
Through analysis of various discounting methods and experiments, we demonstrate the superior performance of UGAE with Beta-weighted discounting over the Monte Carlo baseline on standard RL benchmarks. UGAE is simple and easily integrated into any advantage-based algorithm as a replacement for the standard recursive GAE.

URL: https://openreview.net/forum?id=wZ6pJGRJA4

---

Title: TransFool: An Adversarial Attack against Neural Machine Translation Models

Abstract: Deep neural networks have been shown to be vulnerable to small perturbations of their inputs, known as adversarial attacks. In this paper, we investigate the vulnerability of Neural Machine Translation (NMT) models to adversarial attacks and propose a new attack algorithm called TransFool. To fool NMT models, TransFool builds on a multi-term optimization problem and a gradient projection step. By integrating the embedding representation of a language model, we generate fluent adversarial examples in the source language that maintain a high level of semantic similarity with the clean samples. Experimental results demonstrate that, for different translation tasks and NMT architectures, our white-box attack can severely degrade the translation quality while the semantic similarity between the original and the adversarial sentences stays high. Moreover, we show that TransFool is transferable to unknown target models. Finally, TransFool leads to improvement in terms of success rate, semantic similarity, and fluency compared to the existing attacks both in white-box and black-box settings. Thus, TransFool permits us to better characterize the vulnerability of NMT models and outlines the necessity to design strong defense mechanisms and more robust NMT systems for real-life applications.

URL: https://openreview.net/forum?id=sFk3aBNb81

---

Title: Model Averaging for Manifold Learning

Abstract: Manifold learning aims to extract information of high-dimensional data and provide low-dimensional representations while preserving nonlinear structures of the input data. Numerous manifold learning algorithms have been proposed in the literature. Yet, we lack a canonical quality metric to compare different manifold learning outcomes. We propose a new quality metric that is tuning-free and scale-invariant by utilizing the Mahalanobis distance. Using the proposed quality metric, we develop a model averaging procedure to combine different manifold learning algorithms. The quality metric can also be used for tuning parameter selection. We show for a few synthetic and real data examples that the model averaging outcome always performs similar to the candidate algorithm that yields the best visualization or classification accuracy.

URL: https://openreview.net/forum?id=U8PMpygECy

---

Title: Limitations of the NTK for Understanding Generalization in Deep Learning

Abstract: The ``Neural Tangent Kernel'' (NTK) (Jacot et al 2018), and its empirical variants have been proposed as a proxy to capture certain behaviors of real neural networks. In this work, we study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization. In particular, we demonstrate realistic settings where finite-width neural networks have significantly better data scaling exponents as compared to their corresponding empirical and infinite NTKs at initialization. This reveals a more fundamental difference between the real networks and NTKs, beyond just a few percentage points of test accuracy. Further, we show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling. Finally, we show that the empirical NTK continues to evolve throughout most of the training, in contrast with prior work which suggests that it stabilizes after a few epochs of training. Altogether, our work establishes concrete limitations of the NTK approach in understanding generalization of real networks on natural datasets.

URL: https://openreview.net/forum?id=Y3saBb7mCE

---

Title: Mind the Gap: Mitigating the Distribution Gap in Graph Few-shot Learning

Abstract: Prevailing supervised deep graph learning models often suffer from the issue of label scarcity, leading to performance degradation in the face of limited annotated data. Although numerous graph few-shot learning (GFL) methods have been developed to mitigate this problem, they tend to rely excessively on labeled data. This over-reliance on labeled data can result in impaired generalization ability in the test phase due to the existence of a distribution gap. Moreover, existing GFL methods lack a general purpose as their designs are coupled with task or data-specific characteristics. To address these shortcomings, we propose a novel Self-Distilled Graph Few-shot Learning framework (SDGFL) that is both general and effective. SDGFL leverages a self-distilled contrastive learning procedure to boost GFL. Specifically, our model first pre-trains a graph encoder with contrastive learning using unlabeled data. Later, the trained encoder is frozen as a teacher model to distill a student model with a contrastive loss. The distilled model is then fed to GFL. By learning data representation in a self-supervised manner, SDGFL effectively mitigates the distribution gap and enhances generalization ability. Furthermore, our proposed framework is task and data-independent, making it a versatile tool for general graph mining purposes. To evaluate the effectiveness of our proposed framework, we introduce an information-based measurement that quantifies its capability. Through comprehensive experiments, we demonstrate that SDGFL outperforms state-of-the-art baselines on various graph mining tasks across multiple datasets in the few-shot scenario. We also provide a quantitative measurement of SDGFL’s superior performance in comparison to existing methods.

URL: https://openreview.net/forum?id=LEVbhNrLEL

---

Title: Do We Really Achieve Fairness with Explicit Sensitive Attributes?

Abstract: Recent research on fairness has shown that merely removing sensitive attributes from model inputs is not enough to achieve demographic parity, as non-sensitive attributes can still reveal sensitive information to varying extents. For instance, a person's ``race'' can be deduced from their ``zipcode'' to some extent. While current methods directly utilize explicit sensitive attributes (e.g., ``race'') to debias model predictions (e.g., obtained by ``zipcode''), they often fail to uphold demographic parity. This is especially true for high-sensitive samples, whose non-sensitive attributes are more likely to leak sensitive information than low-sensitive samples. This challenge stems from the model treating each sample with a specific sensitive attribute, while the prediction only incorporates partial sensitive information, leading to potential biases. This observation highlights the need for demographic parity measurements that account for the degree of sensitive information leakage in individual samples, and differentiate between samples with varying degrees of leakage. To address this issue, we introduce a new definition of group fairness measurement called $\alpha$-Demographic Parity, which ensures demographic parity for samples with differing degrees of sensitive information leakage. To achieve $\alpha$-Demographic Parity, we propose to directly promote the independence of model predictions from the distribution of sensitive information, rather than the specific sensitive attributes. This approach directly minimizes the Hilbert-Schmidt Independence Criterion between the two distributions, thereby ensuring more precise and fair predictions across all subgroups. Our proposed method outperforms existing approaches in achieving $\alpha$-Demographic Parity and demonstrates strong performance in scenarios with limited sensitive attribute information, as evidenced by extensive experiments. Our code is anonymously available at https://anonymous.4open.science/r/TMLR_STFS_code-2ED6

URL: https://openreview.net/forum?id=O7RlGx2fJI

---

Title: mL-BFGS: A Momentum-based L-BFGS for Distributed Large-scale Neural Network Optimization

Abstract: Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training.
A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training.
So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time.
In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization.
mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization.
For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes.
We provide a supporting convergence analysis for mL-BFGS in stochastic settings.
To investigate mL-BFGS's potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods).
Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.

URL: https://openreview.net/forum?id=9jnsPp8DP3

---

Title: HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly achieved with a variational autoencoding model, VQ-VAE, which is further extended to hierarchical structures for high-fidelity reconstruction. However, training hierarchical extensions of VQ-VAE is often unstable, where the codebook is not efficiently used to express data well, hence deteriorates reconstruction accuracy. To mitigate this problem, we propose a novel framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally unifies the hierarchical variants of VQ-VAE such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE) and stabilizes their training in a Bayesian scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validate HQ-VAE in terms of its applicability even to a different modality with an audio dataset.

URL: https://openreview.net/forum?id=1rowoeUM5E

---

Title: An Empirical Evaluation of Federated Contextual Bandit Algorithms

Abstract: As the adoption of federated learning increases for learning from sensitive data local to user devices, it is natural to ask if the learning can be done using implicit signals generated as users interact with the applications of interest, rather than requiring access to explicit labels which can be difficult to acquire in many tasks. We approach such problems with the framework of federated contextual bandits, and develop variants of prominent contextual bandit algorithms from the centralized seting for the federated setting. We carefully evaluate these algorithms in a range of scenarios simulated using publicly available datasets. Our simulations model typical setups encountered in the real-world, such as various misalignments between an initial pre-trained model and the subsequent user interactions due to non-stationarity in the data and/or heterogeneity across clients. Our experiments reveal the surprising effectiveness of the simple and commonly used softmax heuristic in balancing the well-know exploration-exploitation tradeoff across the breadth of our settings.

URL: https://openreview.net/forum?id=9zCCYOHkyM

---

Title: Continuous Deep Equilibrium Models: Training Neural Odes Faster by Integrating Them to Infinity

Abstract: Implicit models separate the definition of a layer from the description of its solution process. While implicit layers allow features such as depth to adapt to new scenarios and inputs automatically, this adaptivity makes its computational expense challenging to predict. In this manuscript, **we increase the ``implicitness" of the DEQ by redefining the method in terms of an infinite time neural ODE**, which paradoxically decreases the training cost over a standard neural ODE by $\mathit{2} - \mathit{4 \times}$. Additionally, we address the question: **is there a way to simultaneously achieve the robustness of implicit layers while allowing the reduced computational expense of an explicit layer?** To solve this, we develop Skip and Skip Reg. DEQ, an implicit-explicit (IMEX) layer that simultaneously trains an explicit prediction followed by an implicit correction. We show that training this explicit predictor is free and even decreases the training time by $\mathit{1.11} - \mathit{3.19 \times}$. Together, this manuscript shows how bridging the dichotomy of implicit and explicit deep learning can combine the advantages of both techniques.

URL: https://openreview.net/forum?id=mlI9f7u6Zo

---

Title: Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Abstract: Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution. While much promising work has been accomplished to address this fragility, a systematic study of the role of optimizers and their out-of-distribution generalization performance has not been undertaken. In this study, we examine the performance of popular first-order optimizers for different classes of distributional shift under empirical risk minimization and invariant risk minimization. We address this question for image and text classification using DomainBed, WILDS, and Backgrounds Challenge as testbeds for studying different types of shifts---namely correlation and diversity shift. We search over a wide range of hyperparameters and examine classification accuracy (in-distribution and out-of-distribution) for over 20,000 models. We arrive at the following findings, which we expect to be helpful for practitioners: i) adaptive optimizers (e.g., Adam) perform worse than non-adaptive optimizers (e.g., SGD, momentum SGD) on out-of-distribution performance. In particular, even though there is no significant difference in in-distribution performance, we show a measurable difference in out-of-distribution performance. ii) in-distribution performance and out-of-distribution performance exhibit three types of behavior depending on the dataset---linear returns, increasing returns, and diminishing returns. For example, in the training of natural language data using Adam, fine-tuning the performance of in-distribution performance does not significantly contribute to the out-of-distribution generalization performance.

URL: https://openreview.net/forum?id=ipe0IMglFF

---

Reply all

Reply to author

Forward

0 new messages