Weekly TMLR digest for Sep 04, 2022

5 views

Skip to first unread message

TMLR

unread,

Sep 3, 2022, 8:00:10 PM9/3/22

to tmlr-annou...@googlegroups.com

Accepted papers
===============

Title: Estimating Potential Outcome Distributions with Collaborating Causal Networks

Authors: Tianhui Zhou, William E Carson IV, David Carlson

Abstract: Traditional causal inference approaches leverage observational study data to estimate the difference in observed (factual) and unobserved (counterfactual) outcomes for a potential treatment, known as the Conditional Average Treatment Effect (CATE). However, CATE corresponds to the comparison on the first moment alone, and as such may be insufficient in reflecting the full picture of treatment effects. As an alternative, estimating the full potential outcome distributions could provide greater insights. However, existing methods for estimating treatment effect potential outcome distributions often impose restrictive or overly-simplistic assumptions about these distributions. Here, we propose Collaborating Causal Networks (CCN), a novel methodology which goes beyond the estimation of CATE alone by learning the full potential outcome distributions. Estimation of outcome distributions via the CCN framework does not require restrictive assumptions of the underlying data generating process (e.g. Gaussian errors). Additionally, our proposed method facilitates estimation of the utility of each possible treatment and permits individual-specific variation through utility functions (e.g. risk tolerance variability). CCN not only extends outcome estimation beyond traditional risk difference, but also enables a more comprehensive decision making process through definition of flexible comparisons. Under assumptions commonly made in the causal inference literature, we show that CCN learns distributions that asymptotically capture the correct potential outcome distributions. Furthermore, we propose an adjustment approach that is empirically effective in alleviating sample imbalance between treatment groups in observational studies. Finally, we evaluate the performance of CCN in multiple experiments on both synthetic and semi-synthetic data. We demonstrate that CCN learns improved distribution estimates compared to existing Bayesian and deep generative methods as well as improved decisions with respects to a variety of utility functions.

URL: https://openreview.net/forum?id=q1Fey9feu7

---

Title: Sparse Coding with Multi-layer Decoders using Variance Regularization

Authors: Katrina Evtimova, Yann LeCun

Abstract: Sparse representations of images are useful in many computer vision applications.
Sparse coding with an $l_1$ penalty and a learned linear dictionary requires regularization of the dictionary to prevent a collapse in the $l_1$ norms of the codes. Typically, this regularization entails bounding the Euclidean norms of the dictionary's elements.
In this work, we propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder. Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold over a set of sparse representations for a given set of inputs. Furthermore, we explore ways to effectively train sparse coding systems with multi-layer decoders since they can model more complex relationships than linear dictionaries.
In our experiments with MNIST and natural image patches, we show that decoders learned with our approach have interpretable features both in the linear and multi-layer case. Moreover, we show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations when compared to autoencoders with linear dictionaries. Additionally, sparse representations obtained with our variance regularization approach are useful in the downstream tasks of denoising and classification in the low-data regime.

URL: https://openreview.net/forum?id=4GuIi1jJ74

---

Title: Emergent Abilities of Large Language Models

Authors: Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus

Abstract: Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence raises the question of whether additional scaling could potentially further expand the range of capabilities of language models.

URL: https://openreview.net/forum?id=yzkSU5zdwD

---

Title: Efficient CDF Approximations for Normalizing Flows

Authors: Chandramouli Shama Sastry, Andreas Lehrmann, Marcus A Brubaker, Alexander Radovic

Abstract: Normalizing flows model a complex target distribution in terms of a bijective transform operating on a simple base distribution. As such, they enable tractable computation of a number of important statistical quantities, particularly likelihoods and samples. Despite these appealing properties, the computation of more complex inference tasks, such as the cumulative distribution function (CDF) over a complex region (e.g., a polytope) remains challenging. Traditional CDF approximations using Monte-Carlo techniques are unbiased but have unbounded variance and low sample efficiency. Instead, we build upon the diffeomorphic properties of normalizing flows and leverage the divergence theorem to estimate the CDF over a closed region in target space in terms of the flux across its \emph{boundary}, as induced by the normalizing flow. We describe both deterministic and stochastic instances of this estimator: while the deterministic variant iteratively improves the estimate by strategically subdividing the boundary, the stochastic variant provides unbiased estimates. Our experiments on popular flow architectures and UCI benchmark datasets show a marked improvement in sample efficiency as compared to traditional estimators.

URL: https://openreview.net/forum?id=LnjclqBl8R

---

Title: Unsupervised Dense Information Retrieval with Contrastive Learning

Authors: Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, Edouard Grave

Abstract: Recently, information retrieval has seen the emergence of dense retrievers, using neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results on datasets and tasks where large training sets are available. However, they do not transfer well to new applications with no training data, and are outperformed by unsupervised term-frequency methods such as BM25. In this work, we explore the limits of contrastive learning as a way to train unsupervised dense retrievers and show that it leads to strong performance in various retrieval settings. On the BEIR benchmark our unsupervised model outperforms BM25 on 11 out of 15 datasets for the Recall@100. When used as pre-training before fine-tuning, either on a few thousands in-domain examples or on the large MS~MARCO dataset, our contrastive model leads to improvements on the BEIR benchmark. Finally, we evaluate our approach for multi-lingual retrieval, where training data is even scarcer than for English, and show that our approach leads to strong unsupervised performance. Our model also exhibits strong cross-lingual transfer when fine-tuned on supervised English data only and evaluated on low resources language such as Swahili. We show that our unsupervised models can perform cross-lingual retrieval between different scripts, such as retrieving English documents from Arabic queries, which would not be possible with term matching methods.

URL: https://openreview.net/forum?id=jKN1pXi7b0

---

Title: Equivariant Mesh Attention Networks

Authors: Sourya Basu, Jose Gallego-Posada, Francesco Viganò, James Rowbottom, Taco Cohen

Abstract: Equivariance to symmetries has proven to be a powerful inductive bias in deep learning research. Recent works on mesh processing have concentrated on various kinds of natural symmetries, including translations, rotations, scaling, node permutations, and gauge transformations. To date, no existing architecture is equivariant to all of these transformations. In this paper, we present an attention-based architecture for mesh data that is provably equivariant to all transformations mentioned above. Our pipeline relies on the use of relative tangential features: a simple, effective, equivariance-friendly alternative to raw node positions as inputs. Experiments on the FAUST and TOSCA datasets confirm that our proposed architecture achieves improved performance on these benchmarks and is indeed equivariant, and therefore robust, to a wide variety of local/global transformations.

URL: https://openreview.net/forum?id=3IqqJh2Ycy

---

Title: On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons

Authors: Fangshuo Liao, Anastasios Kyrillidis

Abstract: With the motive of training all the parameters of a neural network, we study why and when one can achieve this by iteratively creating, training, and combining randomly selected subnetworks. Such scenarios have either implicitly or explicitly emerged in the recent literature: see e.g., the Dropout family of regularization techniques, or some distributed ML training protocols that reduce communication/computation complexities, such as the Independent Subnet Training protocol. While these methods are studied empirically and utilized in practice, they often enjoy partial or no theoretical support, especially when applied on neural network-based objectives.

In this manuscript, our focus is on overparameterized single hidden layer neural networks with ReLU activations in the lazy training regime. By carefully analyzing $i)$ the subnetworks' neural tangent kernel, $ii)$ the surrogate functions' gradient, and $iii)$ how we sample and combine the surrogate functions, we prove linear convergence rate of the training error --up to a neighborhood around the optimal point-- for an overparameterized single-hidden layer perceptron with a regression loss. Our analysis reveals a dependency of the size of the neighborhood around the optimal point on the number of surrogate models and the number of local training steps for each selected subnetwork. Moreover, the considered framework generalizes and provides new insights on dropout training, multi-sample dropout training, as well as Independent Subnet Training; for each case, we provide convergence results as corollaries of our main theorem.

URL: https://openreview.net/forum?id=e7mYYMSyZH

---

New submissions
===============

Title: Algorithms and Theory for Supervised Gradual Domain Adaptation

Abstract: The phenomenon of data distribution evolving over time has been observed in a range of applications, calling the needs of adaptive learning algorithms. We thus study the problem of supervised gradual domain adaptation, where labeled data from shifting distributions are available to the learner along the trajectory, and we aim to learn a classifier on a target data distribution of interest. Under this setting, we provide the first generalization upper bound on the learning error under mild assumptions. Our results are algorithm agnostic, general for a range of loss functions, and only depend linearly on the averaged learning error across the trajectory. This shows significant improvement compared to the previous upper bound for unsupervised gradual domain adaptation, where the learning error on the target domain depends exponentially on the initial error on the source domain. Compared with the offline setting of learning from multiple domains, our results also suggest the potential benefits of the temporal structure among different domains in adapting to the target one. Empirically, our theoretical results imply that learning proper representations across the domains will effectively mitigate the learning errors. Motivated by these theoretical insights, we propose a min-max learning objective to learn the representation and classifier simultaneously. Experimental results on both semi-synthetic and large-scale real datasets corroborate our findings and demonstrate the effectiveness of our objectives.

URL: https://openreview.net/forum?id=35y5hv9fbb

---

Title: Proactive Pseudo-Intervention: Pre-informed Contrastive Learning For Interpretable Vision Models

Abstract: Deep neural networks excel at comprehending complex visual signals, delivering on par or even superior performance to that of human experts. However, ad-hoc visual explanations of model decisions often reveal an alarming level of reliance on exploiting non-causal visual
cues that strongly correlate with the target label in training data. As such, deep neural nets suffer compromised generalization to novel inputs collected from different sources, and the reverse engineering of their decision rules offers limited interpretability. To overcome
these limitations, we present a novel contrastive learning strategy called Proactive Pseudo-Intervention (PPI) that leverages proactive interventions to guard against image features with no causal relevance. We also devise a novel pre-informed salience mapping module to
identify key image pixels to intervene, and show it greatly facilitates model interpretability. To demonstrate the utility of our proposal, we benchmark it on both standard natural images and challenging medical image datasets. PPI-enhanced models consistently deliver
superior performance relative to competing solutions, especially on out-of-domain predictions and data integration from heterogeneous sources. Further, saliency maps of models that are trained in our PPI framework are more succinct and meaningful.

URL: https://openreview.net/forum?id=0P2UDcaCnY

---

Title: Neuronal Learning Analysis using Cycle-Consistent Adversarial Networks

Abstract: Recent advances in neural imaging technologies enable high-quality recordings from hundreds of neurons over multiple days, with the potential to uncover how activity in neural circuits reshapes over learning. However, the complexity and dimensionality of population responses pose significant challenges for analysis. To cope with this problem, existing methods for studying neuronal adaptation and learning often impose strong assumptions on the data or model that may result in biased descriptions of the activity changes. In this work, we avoid such biases by developing a data-driven analysis method for revealing activity changes due to task learning. We use a variant of cycle-consistent adversarial networks to learn the unknown mapping from pre- to post-learning neuronal responses. To do so, we develop an end-to-end pipeline to preprocess, train, validate and interpret the unsupervised learning framework with calcium imaging data. We validate our method on two synthetic datasets with known ground-truth transformation, as well as on V1 recordings obtained from behaving mice, where the mice transition from novice to expert-level performance in a visual-based behavioral experiment. We show that our models can identify neurons and spatiotemporal activity patterns relevant to learning the behavioral task, in terms of subpopulations maximizing behavioral decoding performance and task characteristics not explicitly used for training the models. Together, our results demonstrate that analyzing neuronal learning processes with data-driven deep unsupervised methods can unravel activity changes in complex datasets.

URL: https://openreview.net/forum?id=Xpy2g8MINU

---

Title: Joint Representation Training in Sequential Tasks with Shared Structure

Abstract: Classical theory in reinforcement learning (RL) predominantly focuses on the single task setting, where an agent learns to solve a task through trial-and-error experience, given access to data only from that task. However, many recent empirical works have demonstrated the significant practical benefits of leveraging a joint representation trained across multiple, related tasks. In this work we theoretically analyze such a setting, formalizing the concept of \emph{task relatedness} as a shared state-action representation that admits linear dynamics in all the tasks. We introduce the \textsf{Shared-MatrixRL} algorithm for the setting of Multitask \textsf{MatrixRL}~\cite{yang2020reinforcement}. In the presence of $P$ episodic tasks of dimension $d$ sharing a joint $r \ll d$ low-dimensional representation, we show the regret on the the $P$ tasks can be improved from $O(PHd\sqrt{NH})$ to $O((Hd\sqrt{rP} + HP\sqrt{rd})\sqrt{NH})$ over $N$ episodes of horizon $H$. These gains coincide with those observed in other linear models in contextual bandits and RL~\cite{yang2020impact,hu2021near}. In contrast with previous work that have studied multi task RL in other function approximation models, we show that in the presence of bilinear optimization oracle and finite state action spaces there exists a computationally efficient algorithm for multitask \textsf{MatrixRL} via a reduction to quadratic programming. We also develop a simple technique to shave off a $\sqrt{H}$ factor from the regret upper bounds of some episodic linear problems.

URL: https://openreview.net/forum?id=UnxKHyOiSx

---

Title: Multi-objective Bayesian Optimization with Heuristic Objectives for Biomedical and Molecular Data Analysis Workflows

Abstract: Many practical applications require optimization of multiple, computationally expensive, and possibly competing objectives that are well-suited for multi-objective Bayesian optimization (MOBO) procedures. However, for many types of biomedical data, measures of data analysis workflow success are often heuristic and therefore it is not known a priori which objectives are useful. Thus, MOBO methods that return the full Pareto front may be suboptimal in these cases. Here we propose a novel MOBO method that adaptively updates the scalarization function using properties of the posterior of a multi-output Gaussian process surrogate function. This approach selects useful objectives based on a flexible set of desirable criteria, allowing the functional form of each objective to guide optimization. We demonstrate the qualitative behaviour of our method on toy data and perform proof-of-concept analyses of single-cell RNA sequencing and highly multiplexed imaging datasets.

URL: https://openreview.net/forum?id=5MGL0GEaDJ

---

Title: Time Series Alignment with Global Invariances

Abstract: Multivariate time series are ubiquitous objects in signal processing. Measuring a distance or similarity between two such objects is of prime interest in a variety of applications, including machine learning, but can be very difficult as soon as the temporal dynamics and the representation of the time series, i.e. the nature of the observed quantities, differ from one another. In this work, we propose a novel distance accounting both feature space and temporal variabilities by learning a latent global transformation of the feature space together with a temporal alignment, cast as a joint optimization problem. The versatility of our framework allows for several variants depending on the invariance class at stake. Among other contributions, we define a differentiable loss for time series and present two algorithms for the computation of time series barycenters under this new geometry. We illustrate the interest of our approach on both simulated and real world data and show the robustness of our approach compared to state-of-the-art methods.

URL: https://openreview.net/forum?id=JXCH5N4Ujy

---

Title: Pruning Transformers with a Finite Admixture of Keys

Abstract: Pairwise dot product-based self-attention is key to the success of transformers which achieve state-of-the-art performance across a variety of applications in language and vision, but are costly to compute. However, it has been shown that most attention scores and keys in transformers are redundant and can be removed without loss of accuracy. In this paper, we develop a novel probabilistic framework for pruning attention scores and keys in transformers. We first formulate an admixture model of attention keys whose input data to be clustered are attention queries. We show that attention scores in self-attention correspond to the posterior distribution of this model when attention keys admit a uniform prior distribution. We then relax this uniform prior constraint and let the model learn these priors from data, resulting in a new Finite Admixture of Keys (FiAK). The learned priors in FiAK are used for pruning away redundant attention scores and keys in the baseline transformers, improving the diversity of attention patterns that the models capture. We corroborate the efficiency of transformers pruned with FiAK on practical tasks including ImageNet object classification, COCO object detection, and WikiText-103 language modeling. Our experiments demonstrate that transformers pruned with FiAK yield similar or better accuracy than the baseline dense transformers while being much more efficient in terms of memory and computational cost.

URL: https://openreview.net/forum?id=xOkAEuefAK

---

Title: A Generalist Agent

Abstract: Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.

URL: https://openreview.net/forum?id=1ikK0kHjvj

---

Title: A Rigorous Study Of The Deep Taylor Decomposition

Abstract: Saliency methods attempt to explain deep neural networks by highlighting the most salient features of a sample. Some widely used methods are based on a theoretical framework called Deep Taylor Decomposition (DTD), which formalizes the recursive application of the Taylor Theorem to the network's layers. However, recent work has found these methods to be independent of the network's deeper layers and appear to respond only to lower-level image structure. Here, we investigate DTD theory to better understand this perplexing behavior and found that the Deep Taylor Decomposition is equivalent to the basic gradient$\times$input method when the Taylor root points (an important parameter of the algorithm chosen by the user) are locally constant. If the root points are locally input-dependent, then one can justify any explanation. In this case, the theory is under-constrained. In an empirical evaluation, we find that DTD roots do not lie the same linear regions as the input -- contrary to a fundamental assumption of the Taylor Theorem. The theoretical foundations of DTD were cited as a source of reliability for the explanations. However, our findings urge caution in making such claims.

URL: https://openreview.net/forum?id=Y4mgmw9OgV

---

Reply all

Reply to author

Forward

0 new messages