Weekly TMLR digest for Mar 26, 2023

3 views
Skip to first unread message

TMLR

unread,
Mar 25, 2023, 8:00:14 PM3/25/23
to tmlr-annou...@googlegroups.com


New certifications
==================

Featured Certification: Identification of Negative Transfers in Multitask Learning Using Surrogate Models

Dongyue Li, Huy Nguyen, Hongyang Ryan Zhang

https://openreview.net/forum?id=KgfFAI9f3E

---


Accepted papers
===============


Title: Improving Generalization with Approximate Factored Value Functions

Authors: Shagun Sodhani, Sergey Levine, Amy Zhang

Abstract: Reinforcement learning in general unstructured MDPs presents a challenging learning problem. However, certain MDP structures, such as factorization, are known to simplify the learning problem. This fact is often not useful in complex tasks with high-dimensional state spaces which do not usually exhibit such structure, and even if the structure is present, it is typically unknown. In this work, we instead turn this observation on its head. Instead of developing algorithms for structured MDPs, we propose a representation learning algorithm that approximates an unstructured MDP with one that has factorized structure. We then use these factors as a more convenient representation of the state for downstream learning. The particular structure that we leverage is reward factorization, which defines a more compact class of MDPs that admit factorized value functions. We empirically verify the effectiveness of our approach in terms of faster training (better sample complexity) and robust zero-shot transfer (better generalization) on the ProcGen benchmark and the MiniGrid environments.

URL: https://openreview.net/forum?id=LwEWrrKyja

---

Title: FLUID: A Unified Evaluation Framework for Flexible Sequential Data

Authors: Matthew Wallingford, Aditya Kusupati, Keivan Alizadeh-Vahid, Aaron Walsman, Aniruddha Kembhavi, Ali Farhadi

Abstract: Modern machine learning methods excel when training data is IID, large-scale, and well labeled. Learning in less ideal conditions remains an open challenge. The sub-fields of few-shot, continual, transfer, and representation learning have made substantial strides in learning under adverse conditions, each affording distinct advantages through methods and insights. These methods address different challenges such as data arriving sequentially or scarce training examples, however often the difficult conditions an ML system will face over its lifetime cannot be anticipated prior to deployment. Therefore, general ML systems which can handle the many challenges of learning in practical settings are needed. To foster research towards the goal of general ML methods, we introduce a new unified evaluation framework – FLUID (Flexible Sequential Data). FLUID integrates the objectives of few-shot, continual, transfer, and representation learning while enabling comparison and integration of techniques across these subfields. In FLUID, a learner faces a stream of data and must make sequential predictions while choosing how to update itself, adapt quickly to novel classes, and deal with changing data distributions; while accounting for the total amount of compute. We conduct experiments on a broad set of methods which shed new insight on the advantages and limitations of current techniques and indicate new research problems to solve. As a starting point towards more general methods, we present two new baselines which outperform other evaluated methods on FLUID.

URL: https://openreview.net/forum?id=UvJBKWaSSH

---

Title: The Low-Rank Simplicity Bias in Deep Networks

Authors: Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, Phillip Isola

Abstract: Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of
empirical observations that investigate and extend the hypothesis that deeper networks are inductively biased to find solutions with lower effective rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low effective rank embedding
increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well. We then show that the simplicity bias exists
at both initialization and after training and is resilient to hyper-parameters and learning methods. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance on CIFAR and
ImageNet without changing the modeling capacity.

URL: https://openreview.net/forum?id=bCiNWDmlY2

---

Title: Identification of Negative Transfers in Multitask Learning Using Surrogate Models

Authors: Dongyue Li, Huy Nguyen, Hongyang Ryan Zhang

Abstract: Multitask learning is widely used in practice to train a low-resource target task by augmenting it with multiple related source tasks. Yet, naively combining all the source tasks with a target task does not always improve the prediction performance for the target task due to negative transfers. Thus, a critical problem in multitask learning is identifying subsets of source tasks that would benefit the target task. This problem is computationally challenging since the number of subsets grows exponentially with the number of source tasks; efficient heuristics for subset selection does not always capture the relationship between task subsets and multitask learning performances. In this paper, we introduce an efficient procedure to address this problem via surrogate modeling. In surrogate modeling, we sample (random) subsets of source tasks and precompute their multitask learning performances; Then, we approximate the precomputed performances with a linear regression model that can also be used to predict the multitask performance of unseen task subsets. We show theoretically and empirically that fitting this model only requires sampling linearly many subsets in the number of source tasks. The fitted model provides a relevance score between each source task and the target task; We use the relevance scores to perform subset selection for multitask learning by thresholding. Through extensive experiments, we show that our approach predicts negative transfers from multiple source tasks to target tasks much more accurately than existing task affinity measures. Additionally, we demonstrate that for five weak supervision datasets, our approach consistently improves upon existing optimization methods for multi-task learning.

URL: https://openreview.net/forum?id=KgfFAI9f3E

---

Title: Multi-objective Bayesian Optimization with Heuristic Objectives for Biomedical and Molecular Data Analysis Workflows

Authors: Alina Selega, Kieran R. Campbell

Abstract: Many practical applications require optimization of multiple, computationally expensive, and possibly competing objectives that are well-suited for multi-objective Bayesian optimization (MOBO). However, for many types of biomedical data, measures of data analysis workflow success are often heuristic and therefore it is not known a priori which objectives are useful. Thus, MOBO methods that return the full Pareto front may be suboptimal in these cases. Here we propose a novel MOBO method that adaptively updates the scalarization function using properties of the posterior of a multi-output Gaussian process surrogate function. This approach selects useful objectives based on a flexible set of desirable criteria, allowing the functional form of each objective to guide optimization. We demonstrate the qualitative behaviour of our method on toy data and perform proof-of-concept analyses of single-cell RNA sequencing and highly multiplexed imaging datasets for univariate input optimization.

URL: https://openreview.net/forum?id=QspAcsAyis

---

Title: Parameter Efficient Node Classification on Homophilic Graphs

Authors: Lucas Prieto, Jeroen Den Boef, Paul Groth, Joran Cornelisse

Abstract: Deep Learning on Graphs was recently made possible with the introduction of Graph Neural Networks (GNNs). GNNs use learnable diffusion processes to propagate information through the graph and improve performance on downstream tasks. However, learning this diffusion process can be expensive in terms of memory and computation. While a lot of research has gone into making these models more expressive and able to capture more complex patterns, in practice, edges in common benchmarking datasets often encode similarity of nodes with respect to the downstream task. This property is called homophily. We argue that for these homophilic graphs, learnable diffusion processes and large receptive fields are not required to achieve competitive performance. We propose Graph Non-Parametric Diffusion (GNPD) a method that outperforms traditional GNNs using only 2 linear models and non-parameteric diffusion. Our method takes ideas from Correct & Smooth (C&S) and the Scalable Inception Graph Network (SIGN) and combines them to create a simpler model that outperforms both of them on several datasets. Our method achieves an unmatched parameter efficiency, competing with models with two orders of magnitude more parameters. Additionally GNPD can also forego spectral embeddings which are the computational bottleneck of the C&S method.

URL: https://openreview.net/forum?id=LIT8tjs6rJ

---


New submissions
===============


Title: Greedier is Better: Selecting Multiple Neighbors per Iteration for Sparse Subspace Clustering

Abstract: Sparse subspace clustering using greedy-based neighbor selection, such as orthogonal matching pursuit (OMP), has been known as a popular computationally-efficient alternative to the standard $\ell_1$-minimization based methods. This paper proposes a new SSC scheme using generalized OMP (GOMP), a soup-up of OMP whereby multiple, say ($p\geq 1$), neighbors are identified in each iteration, along with a new stopping rule requiring nothing more than a knowledge of the ambient signal dimension and the number $p$ of neighbors identified per iteration. Compared to conventional OMP (i.e., $p=1$), the proposed GOMP method involves fewer iterations, thereby enjoying lower algorithmic complexity; in addition, the proposed stopping rule is free from an off-line estimation of the subspace dimension or noise strength. Under the semi-random model, analytic performance guarantees, in terms of neighbor recovery rates, are established to justify the advantage of the proposed GOMP. Under mild conditions it is shown that, with a high probability, (i) GOMP can retrieve more true neighbors than OMP, consequently yielding higher data clustering accuracy, and (ii) the proposed stopping rule halts neighbor search once the number of recovered neighbors is close to the subspace dimension. Issues about selecting $p$ for practical implementation are also discussed. Computer simulations using both synthetic and real data are provided to demonstrate the effectiveness of the proposed approach and validate our analytic study.

URL: https://openreview.net/forum?id=djD8IbSvgm

---

Title: Test-Time Adaptation for Visual Document Understanding

Abstract: For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling, as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We introduce new benchmarks using existing public datasets for various VDU tasks, including entity recognition, key-value extraction, and document visual question answering. DocTTA shows significant improvements on these compared to the source model performance, up to 1.89\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively.

URL: https://openreview.net/forum?id=zshemTAa6U

---

Title: Fair Kernel Regression through Cross-Covariance Operators

Abstract: Ensuring fairness in machine learning models is a difficult problem from both a formulation and implementation perspective. One sensible criterion for achieving fairness is Equalised Odds, which requires that subjects in protected and unprotected groups have equal true and false positive rates. However, practical implementation is challenging. This work proposes two ways to address this issue through the conditional independence operator. First, given the output values, it is used as a fairness measure of independence between model predictions and sensitive variables. Second, it is used as a regularisation term in the problem formulation, which seeks optimal models that balance performance and fairness concerning the sensitive variables. To illustrate the potential of our approach, we consider different scenarios. First, we use the Gaussian model to provide new insights into the problem formulation and numerical results on its convergence. Second, we present the formulation using the conditional cross-covariance operator. We anticipate that a closed-form solution is possible in the general problem formulation, including in the case of a kernel formulation setting. Third, we introduce a normalised criterion of the conditional independence operator. All formulations are posed under the risk minimisation principle, which leads to theoretical results on the performance.
Additionally, insights are provided into using these operators under a Gaussian Process setting. Our methods are compared to state-of-the-art methods in terms of performance and fairness metrics on a representative set of real problems. The results obtained with our proposed methodology show promising performance-fairness curves. Furthermore, we discuss the usefulness of linear weights in the fair model to describe the behaviour of the features when enforcing fairness over a particular set of input features.

URL: https://openreview.net/forum?id=MyQ1e1VQQ3

---

Title: INTEGRATE: Distance based Graph Convolutional Networks for Statistical Relational Learning

Abstract: Recently, several successful methods for learning embeddings of large knowledge bases have been developed. They have been motivated through the inevitability of learning and reasoning about various entities, their attributes and relations present in the knowledge bases. A potential limitation of much of this line of research is that the inherent semantic structure of the network is not exploited. To overcome this limitation, graph convolutional networks (GCNs) were proposed that generalized neural network models to multi-relational, graph-structured data sets. We consider the problem of learning distance-based Graph Convolutional Networks (GCNs) for multi-relational data within statistical relational learning. Specifically, we first embed the original graph into the Euclidean space R^m using a relational density estimation technique thereby constructing a secondary Euclidean graph. The graph vertices correspond to the target triples and edges denote the Euclidean distances between the target triples. We emphasize the importance of learning the secondary Euclidean graph and the advantages of employing a distance matrix over the typically used adjacency matrix. Our comprehensive empirical evaluation demonstrates the superiority of our approach over 15 approaches spread over different GCN models, relational embedding techniques, rule learning techniques and relational models.

URL: https://openreview.net/forum?id=VJK8jLGHLR

---

Title: Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration

Abstract: Inversion by Direct Iteration (InDI) is a new formulation for supervised image restoration that avoids the so-called ``regression to the mean'' effect and produces more realistic and detailed images than existing regression-based methods. It does this by gradually improving image quality in small steps, similar to generative denoising diffusion models.

Image restoration is an ill-posed problem where multiple high-quality images are plausible reconstructions of a given low-quality input. Therefore, the outcome of a single step regression model is typically an aggregate of all possible explanations, therefore lacking details and realism. The main advantage of InDI is that it does not try to predict the clean target image in a single step but instead gradually improves the image in small steps, resulting in better perceptual quality.

While generative denoising diffusion models also work in small steps, our formulation is distinct in that it does not require knowledge of any analytic form of the degradation process. Instead, we directly learn an iterative restoration process from low-quality and high-quality paired examples. InDI can be applied to virtually any image degradation, given paired training data. In conditional denoising diffusion image restoration the denoising network generates the restored image by repeatedly denoising an initial image of pure noise, conditioned on the degraded input. Contrary to conditional denoising formulations, InDI directly proceeds by iteratively restoring the input low-quality image, producing high-quality results on a variety of image restoration tasks, including motion and out-of-focus deblurring, super-resolution, compression artifact removal, and denoising.

URL: https://openreview.net/forum?id=VmyFF5lL3F

---

Title: Bayesian Importance of Features (BIF)

Abstract: We introduce a framework that provides quantitative explanations of statistical models through the probabilistic assessment of input feature importance. The core idea comes from utilizing the Dirichlet distribution to define the importance of input features and learning it via approximate Bayesian inference. The learned importance has probabilistic interpretation and provides the relative significance of each input feature to a model’s output, additionally assessing confidence about its importance quantification. As a consequence of using the Dirichlet distribution over the explanations, we can define a closed-form divergence to gauge the similarity between learned importance under different models. We use this divergence to study the feature importance explainability tradeoffs with essential notions in modern machine learning, such as privacy and fairness. Furthermore, BIF can work on two levels: global explanation (feature importance across all data instances) and local explanation (individual feature importance for each data instance). We show the effectiveness of our method on a variety of synthetic and real datasets, taking into account both tabular and image datasets. The code can be found at \url{https://anonymous.4open.science/r/BIF-45EF/}

URL: https://openreview.net/forum?id=6lqgrBMrg3

---

Title: Finding Competence Regions in Domain Generalization

Abstract: We propose a "learning to reject" framework to address the problem of silent failures in Domain Generalization (DG), where the test distribution differs from the training distribution. Assuming a mild distribution shift, we wish to accept out-of-distribution (OOD) data whenever a model's estimated competence foresees trustworthy responses, instead of rejecting OOD data outright. Trustworthiness is then predicted via a proxy incompetence score that is tightly linked to the performance of a classifier. We present a comprehensive experimental evaluation of incompetence scores for classification and highlight the resulting trade-offs between rejection rate and accuracy gain. For comparability with prior work, we focus on standard DG benchmarks and consider the effect of measuring incompetence via different learned representations in a closed versus an open world setting. Our results suggest that increasing incompetence scores are indeed predictive of reduced accuracy, leading to significant improvements of the average accuracy below a suitable incompetence threshold. However, the scores are not yet good enough to allow for a favorable accuracy/rejection trade-off in all tested domains. Surprisingly, our results also indicate that classifiers optimized for DG robustness do not outperform a naive Empirical Risk Minimization (ERM) baseline in the competence region, that is, where test samples elicit low incompetence scores.

URL: https://openreview.net/forum?id=TSy0vuwQFN

---

Title: Generalizability of Adversarial Robustness Under Distribution Shifts

Abstract: Recent progress in empirical and certified robustness promises to deliver reliable and deployable Deep Neural Networks (DNNs). Despite that success, most existing evaluations of DNN robustness have been done on images sampled from the same distribution on which the model was trained on. However, in the real world, DNNs may be deployed in dynamic environments that exhibit significant distribution shifts. In this work, we take a first step towards thoroughly investigating the interplay between empirical and certified adversarial robustness on one hand and domain generalization on another. To do so, we train robust models on multiple domains and evaluate their accuracy and robustness on an unseen domain. We observe that: (1) both empirical and certified robustness generalize to unseen domains, and (2) the level of generalizability does not correlate well with input visual similarity, measured by the FID between source and target domains. We also extend our study to cover a real-world medical application, in which adversarial augmentation significantly boosts the generalization of robustness with minimal effect on clean data accuracy.

URL: https://openreview.net/forum?id=XNFo3dQiCJ

---

Title: Using Confounded Data in Reinforcement Learning

Abstract: In the presence of confounding, naively using off-the-shelf offline reinforcement learning (RL) algorithms leads to sub-optimal behaviour. In this work, we propose a safe method to exploit confounded offline data in model-based RL, which improves the sample-efficiency of an interactive agent that also collects online, unconfounded data. First, we import ideas from the well-established framework of $do$-calculus to express model-based RL as a causal inference problem, thus bridging the gap between the fields of RL and causality. Then, we propose a generic method for learning a causal transition model from offline and online data, which captures and corrects the confounding effect using a hidden latent variable. We prove that our method is correct and efficient, in the sense that it attains better generalization guarantees thanks to the confounded offline data (in the asymptotic case), regardless of the confounding effect (the offline expert's behaviour). We showcase our method on a series of synthetic experiments, which demonstrate that a) using confounded offline data naively degrades the sample-efficiency of an RL agent; b) using confounded offline data correctly improves sample-efficiency.

URL: https://openreview.net/forum?id=nFWRuJXPkU

---

Title: Chasing Better Deep Image Priors between Over- and Under-parameterization

Abstract: Deep Neural Networks (DNNs) are well-known to act as \textbf{over-parameterized} deep image priors (DIP) that regularize various image inverse problems. Meanwhile, researchers also proposed extremely compact, \textbf{under-parameterized} image priors (e.g., deep decoder) that are strikingly competent for image restoration too, despite a loss of accuracy. These two extremes push us to think whether there exists a better solution in the middle: \textit{between over- and under-parameterized image priors, can one identify ``intermediate" parameterized image priors that achieve better trade-offs between performance, efficiency, and even preserving strong transferability?} Drawing inspirations from the lottery ticket hypothesis (LTH), we conjecture and study a novel ``lottery image prior" (\textbf{LIP}) by exploiting DNN inherent sparsity, stated as: \textit{given an over-parameterized DNN-based image prior, it will contain a sparse subnetwork that can be trained in isolation, to match the original DNN's performance when being applied as a prior to various image inverse problems}. Our results validate the superiority of LIPs: we can successfully locate the LIP subnetworks from over-parameterized DIPs at substantial sparsity ranges. Those LIP subnetworks significantly outperform deep decoders under comparably compact model sizes (by often fully preserving the effectiveness of their over-parameterized counterparts), and they also possess high transferability across different images as well as restoration task types. Besides, we also extend LIP to compressive sensing image reconstruction, where a \textit{pre-trained} GAN generator is used as the prior (in contrast to \textit{untrained} DIP or deep decoder), and confirm its validity in this setting too. To our best knowledge, this is the first time that LTH is demonstrated to be relevant in the context of inverse problems or image priors. Codes will be publicly available upon acceptance.

URL: https://openreview.net/forum?id=EwJJks2cSa

---

Title: Causally-guided Regularization of Graph Attention Improves Generalizability

Abstract: Graph attention networks estimate the relational importance of node neighbors to aggregate relevant information over local neighborhoods for a prediction task. However, the inferred attentions are vulnerable to spurious correlations and connectivity in the training data, hampering the generalizability of models. We introduce CAR, a general-purpose regularization framework for graph attention networks. Embodying a causal inference approach based on invariance prediction, CAR aligns the attention mechanism with the causal effects of active interventions on graph connectivity in a scalable manner. CAR is compatible with a variety of graph attention architectures, and we show that it systematically improves generalizability on various node classification tasks. Our ablation studies indicate that CAR hones in on the aspects of graph structure most pertinent to the prediction (e.g., homophily), and does so more effectively than alternative approaches. Finally, we also show that \methodname enhances interpretability of attention coefficients by accentuating node-neighbor relations that point to causal hypotheses.

URL: https://openreview.net/forum?id=iDNMZgjJuJ

---

Title: Semantic Self-adaptation: Enhancing Generalization with a Single Sample

Abstract: The lack of out-of-domain generalization is a critical weakness of deep networks for semantic segmentation. Previous studies relied on the assumption of a static model, i. e., once the training process is complete, model parameters remain fixed at test time. In this work, we challenge this premise with a self-adaptive approach for semantic segmentation that adjusts the inference process to each input sample. Self-adaptation operates on two levels. First, it fine-tunes the parameters of convolutional layers to the input image using consistency regularization. Second, in Batch Normalization layers, it interpolates between the training and the reference distribution derived from a single test sample. Despite these techniques being well known in the literature, we surprisingly find their combination to set new state-of-the-art accuracy on synthetic-to-real generalization benchmarks. Our empirical study suggests that self-adaptation may complement the established practice of model regularization at training time for improving deep network generalization to out-of-domain data.

URL: https://openreview.net/forum?id=ILNqQhGbLx

---

Title: Revisiting Self-Distillation

Abstract: Knowledge distillation is the procedure of transferring ``knowledge'' from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture, this procedure is called self-distillation. Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. In this work, we systematically study self-distillation in a number of settings. First, we show that even with a highly accurate teacher, self-distillation allows a student to surpass the teacher in all cases. Second, we revisit existing conceptual explanations of self-distillation and identify contradicting test cases, revealing possible drawbacks of these explanations. Third, we provide an alternative explanation for the dynamics of self-distillation through the lens of loss landscape geometry. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization. Finally, we study what properties can self-distillation transfer from teachers to students, beyond task accuracy. We show that a student can inherit natural robustness by leveraging the soft outputs of the teacher, while merely training on ground-truth labels will make the student less robust.

URL: https://openreview.net/forum?id=HvzK6KXMcT

---

Title: Bandwidth Enables Generalization in Quantum Kernel Models

Abstract: Quantum computers are known to provide speedups over classical state-of-the-art machine learning methods in some specialized settings. For example, quantum kernel methods have been shown to provide an exponential speedup on a learning version of the discrete logarithm problem. Understanding the generalization of quantum models is essential to realizing similar speedups on problems of practical interest. Recent results demonstrate that generalization is hindered by the exponential size of the quantum feature space. Although these results suggest that quantum models cannot generalize when the number of qubits is large, in this paper we show that these results rely on overly restrictive assumptions. We consider a wider class of models by varying a hyperparameter that we call quantum kernel bandwidth. We analyze the large-qubit limit and provide explicit formulas for the generalization of a quantum model that can be solved in closed form. Specifically, we show that changing the value of the bandwidth can take a model from provably not being able to generalize to any target function to good generalization for well-aligned targets. Our analysis shows how the bandwidth controls the spectrum of the kernel integral operator and thereby the inductive bias of the model. We demonstrate empirically that our theory correctly predicts how varying the bandwidth affects generalization of quantum models on challenging datasets, including those far outside our theoretical assumptions. We discuss the implications of our results for quantum advantage in machine learning.

URL: https://openreview.net/forum?id=A1N2qp4yAq

---

Title: Representation Balancing with Decomposed Patterns for Treatment Effect Estimation

Abstract: Estimating treatment effects from observational data is subject to a covariate shift problem incurred by selection bias. Recent research has sought to mitigate this problem by balancing the distribution of representations between the treated and controlled groups. The rationale behind this is that counterfactual estimation relies on (1) preserving the predictive power of factual outcomes and (2) learning balanced representations. However, there is a trade-off between achieving these two objectives. In this paper, we propose a novel model, DIGNet, which is designed to capture the patterns that contribute to outcome prediction (task 1) and representation balancing (task 2) respectively. Specifically, we derive a theoretical upper bound that links the concept of propensity confusion to representation balancing, and further transform the balancing Patterns into Decompositions of Individual propensity confusion and Group distance minimization (PDIG) to capture more effective balancing patterns. Moreover, we suggest decomposing proxy features into Patterns of Pre-balancing and Balancing Representations (PPBR) to preserve patterns that are beneficial for outcome modeling. Extensive experiments confirm that PDIG and PPBR follow different pathways to achieve the same goal of improving treatment effect estimation. We hope our findings can be heuristics for investigating factors influencing the generalization of representation balancing models in counterfactual estimation.

URL: https://openreview.net/forum?id=uyp8eFbzzT

---

Title: Augmented Language Models: a Survey

Abstract: This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a
code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demonstrations. While adhering to a standard missing tokens prediction objective, such augmented LMs can use various, possibly non-parametric external modules to expand their context processing ability, thus departing from the pure language modeling paradigm. We therefore refer to them as Augmented Language Models (ALMs). The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks and even outperforming most regular LMs on several benchmarks. In this work, after reviewing current advance in ALMs, we conclude that this new research direction has the potential to address common limitations of traditional LMs such as interpretability, consistency, and scalability issues.

URL: https://openreview.net/forum?id=jh7wH2AzKK

---

Title: Unsupervised Domain Adaptation via Minimized Joint Error

Abstract: Unsupervised domain adaptation transfers knowledge from a learned source domain to a different target distribution, for which only few or no labeled data is available. Some researchers proposed upper bounds for the target error when transferring the knowledge, i.e., Ben-David et al. (2010) established a theory based on minimizing the source error and distance between marginal distributions simultaneously. However, in most works the joint error is usually ignored due to the intractability. In this paper, we argue that the joint error is essential for the domain adaptation problem, in particular if the samples from different classes in source/target are closely aligned when matching the marginal distributions due to a large domain gap. To tackle this problem, we propose a novel objective that relates to an upper bound of the joint error. Moreover, we adopt a source/pseudo-target labels induced hypothesis space that can reduce the searching space to further tighten up this bound. For the dissimilarity measurement between hypotheses, we propose a novel cross margin discrepancy to alleviate the instability during adversarial learning. In addition, we present extensive empirical evidence that shows that our proposal boosts the performance in image classification accuracy on standard domain adaptation benchmarks.

URL: https://openreview.net/forum?id=kiPsMct7vL

---

Title: JiangJun: Mastering Xiangqi by Tackling Non-Transitivity in Two-Player Zero-Sum Games

Abstract: In this paper we present an empirical study of non-transitivity in perfect-information games by studying Xiangqi, a traditional board game in China with similar game-tree complexity to chess and shogi. After analyzing over 10,000 human Xiangqi playing records, we demonstrate that the game’s strategic structure contains both transitive and non-transitive components. To address non-transitivity, we propose the JiangJun algorithm, which combines Monte-Carlo Tree Search (MCTS) with Policy Space Response Oracles (PSRO) to find an approximate Nash equilibrium. We evaluate the algorithm empirically using a WeChat mini program and achieve a Master level with a 99.39% win rate against human players. The algorithm’s effectiveness in overcoming non-transitivity is confirmed by relative population performance and visualization results.

URL: https://openreview.net/forum?id=MMsyqXIJuk

---

Title: Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling

Abstract: Trying to capture the sample-label relationship, conditional generative models often end up inheriting the spurious correlation in the training dataset, giving label-conditional distributions that are severely imbalanced in another latent attribute. To mitigate such undesirable correlations engraved into generative models, which we call spurious causality, we propose a general two-step strategy. (a) Fairness Intervention (FI): Emphasize the minority samples that are hard to be generated due to the spurious correlation in the training dataset. (b) Corrective Sampling (CS): Filter the generated samples explicitly to follow the desired label-conditional latent attribute distribution. We design the fairness intervention for various degrees of supervision on the spurious attribute, including unsupervised, weakly-supervised, and semi-supervised scenarios. Our experimental results show that the proposed FICS can successfully resolve the spurious correlation in generated samples on various datasets.

URL: https://openreview.net/forum?id=VV4zJwLwI7

---

Title: On the Robustness of Dataset Inference

Abstract: Machine learning (ML) models are costly to train as they can require a significant amount of data, computational resources and technical expertise. Thus, they constitute valuable intellectual property that needs protection from adversaries wanting to steal them. Ownership verification techniques allow the victims of model stealing attacks to demonstrate that a suspect model was in fact stolen from theirs. Although a number of ownership verification techniques based on watermarking or fingerprinting have been proposed, most of them fall short either in terms of security guarantees (well-equipped adversaries can evade verification) or computational cost. A fingerprinting technique introduced at ICLR '21, Dataset Inference (DI) has been shown to offer better robustness and efficiency than prior methods. The authors of DI provided a correctness proof for linear (suspect) models. However, in a subspace of the same setting, we prove that DI suffers from high false positives (FPs) -- it can incorrectly identify an independent model trained with non-overlapping data from the same distribution as stolen. We further prove that DI also triggers FPs in realistic, non-linear suspect models. We then confirm empirically that DI in the black-box setting leads to FPs, with high confidence. Second, we show that DI also suffers from false negatives (FNs) -- an adversary can fool DI by regularising a stolen model's decision boundaries using adversarial training, thereby leading to an FN. To this end, we demonstrate that black-box DI fails to identify a model adversarially trained from a stolen dataset -- the setting where DI is the hardest to evade. Finally, we discuss the implications of our findings, the viability of fingerprinting-based ownership verification in general, and suggest directions for future work.

URL: https://openreview.net/forum?id=LKz5SqIXPJ

---

Title: On the Predictive Accuracy of Neural Temporal Point Process Models for Continuous-time Event Data

Abstract: The framework of Temporal Point Processes (TPPs) is the default paradigm used to model sequences of events occurring asynchronously in continuous time. Classical TPP models often rely on strong modeling assumptions which intrinsically limits their capacity in modeling complex real-world event dynamics. To address this limitation, neural network parametrizations of TPPs, referred to as Neural TPPs, have been proposed to allow more flexible and efficient modeling. Although recent research supports the effectiveness of neural TPPs, their analysis is often based on different baselines, datasets, and experimental setups. As a result, it is hard to pinpoint the source of empirical gains, which is a major limitation of research progress. To bridge this gap, we conduct a large-scale experimental study to assess the predictive accuracy of state-of-the-art neural TPP models on multiple real-world and synthetic event sequence datasets in a carefully designed unified setup. We study the influence of each major architectural component (event encoding, history encoder, decoder parametrization) for both time and mark prediction tasks. We also address the rarely discussed topic of probabilistic calibration for neural TPP models. Finally, we also draw meaningful conclusions from the analysis of our results, including the importance of the history size and the impact of the architectural components on predictive accuracy, as well as the poorly calibrated mark distributions of neural TPP models.

URL: https://openreview.net/forum?id=3OSISBQPrM

---

Title: How explainable are adversarially-robust CNNs?

Abstract: Three important criteria of existing convolutional neural networks (CNNs) are (1) test-set accuracy; (2) out-of-distribution accuracy; and (3) explainability. While these criteria have been studied independently, their relationship is unknown. For example, do CNNs with better out-of-distribution performance also have better explainability? Furthermore, most prior explainability studies only evaluate methods on 2-3 common vanilla ImageNet-trained CNNs, leaving it unknown how these methods generalize to CNNs of other architectures and training algorithms. Here, we perform the first large-scale evaluation of the relations of the three criteria using nine feature-importance methods and 12 ImageNet-trained CNNs that are of three training algorithms and five CNN architectures. We find several important insights and recommendations for ML practitioners. First, adversarially robust CNNs have a higher explainability score on gradient-based attribution methods (but not CAM-based or perturbation-based methods). Second, AdvProp models, despite being highly accurate, are not superior in explainability. Third, among the nine feature attribution methods tested, GradCAM and RISE are consistently the best methods.
Fourth, Insertion and Deletion are biased towards vanilla and robust models, respectively, due to their strong correlation with the confidence score distributions of a CNN. Fifth, we did not find a single CNN to be the best in all three criteria, which suggests that CNNs with better performance do not have better explainability. Sixth, ResNet-50 is, on average, the best architecture among the architectures used in this study, which indicates architectures with higher test-set accuracy do not necessarily have better explainability scores.

URL: https://openreview.net/forum?id=x4yPXCfUMT

---

Reply all
Reply to author
Forward
0 new messages