CI, short for Continuous Integration, is a software development practice in which all developers merge code changes in a central repository multiple times a day. CD stands for Continuous Delivery, which on top of Continuous Integration adds the practice of automating the entire software release process.
These are the subjects of this "gentle introduction." Our hope is to give you a flavor of what is involved, and to point you to other sources in case you want to learn more. Our business is based largely on what is commonly called "lens design" or (more correctly) optical design. This article is for the general reader who may be curious about this area of applied optics.
As Einstein might have said, "As gentle as possible, but no gentler.". There is alot of new stuff to learn here, and it's different enough to require somerearrangement of your mental furniture. By 'gentle' I mean that the features arepresented practically with examples; as we encounter difficulties, I hope toshow how Rust solves these problems. It is important to understand the problems beforethe solutions make sense. To put it in flowery language, we are going for a hikein hilly country and I will point out some interesting rock formations on the way,with only a few geology lectures. There will be some uphill but the view will beinspiring; the community is unusually pleasant and happy to help.There is the Rust Users Forum and an activesubreddit which is unusually well-moderated.The FAQ is a good resource ifyou have specific questions.
Graphs are all around us; real world objects are often defined in terms of their connections to other things. A set of objects, and the connections between them, are naturally expressed as a graph. Researchers have developed neural networks that operate on graph data (called graph neural networks, or GNNs) for over a decade. Recent developments have increased their capabilities and expressive power. We are starting to see practical applications in areas such as antibacterial discovery , physics simulations , fake news detection , traffic prediction and recommendation systems .
[["Monti2018-ov","title":"Dual-Primal Graph Convolutional Networks","author":"Monti, Federico and Shchur, Oleksandr and Bojchevski, Aleksandar and Litany, Or and Gunnemann, Stephan and Bronstein, Michael M","abstract":"In recent years, there has been a surge of interest in developing deep learning methods for non-Euclidean structured data such as graphs. In this paper, we propose Dual-Primal Graph CNN, a graph convolutional architecture that alternates convolution-like operations on the graph and its dual. Our approach allows to learn both vertex- and edge features and generalizes the previous graph attention (GAT) model. We provide extensive experimental validation showing state-of-the-art results on a variety of tasks tested on established graph benchmarks, including CORA and Citeseer citation networks as well as MovieLens, Flixter, Douban and Yahoo Music graph-guided recommender systems.","month":"jun","year":"2018","eprint":"1806.00770","type":"ARTICLE"],["Battaglia2018-pi","title":"Relational inductive biases, deep learning, and graph networks","author":"Battaglia, Peter W and Hamrick, Jessica B and Bapst, Victor and Sanchez-Gonzalez, Alvaro and Zambaldi, Vinicius and Malinowski, Mateusz and Tacchetti, Andrea and Raposo, David and Santoro, Adam and Faulkner, Ryan and Gulcehre, Caglar and Song, Francis and Ballard, Andrew and Gilmer, Justin and Dahl, George and Vaswani, Ashish and Allen, Kelsey and Nash, Charles and Langston, Victoria and Dyer, Chris and Heess, Nicolas and Wierstra, Daan and Kohli, Pushmeet and Botvinick, Matt and Vinyals, Oriol and Li, Yujia and Pascanu, Razvan","abstract":"Artificial intelligence (AI) has undergone a renaissance recently, making major progress in key domains such as vision, language, control, and decision-making. This has been due, in part, to cheap data and cheap compute resources, which have fit the natural strengths of deep learning. However, many defining characteristics of human intelligence, which developed under much different pressures, remain out of reach for current approaches. In particular, generalizing beyond one's experiences--a hallmark of human intelligence from infancy--remains a formidable challenge for modern AI. The following is part position paper, part review, and part unification. We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. Just as biology uses nature and nurture cooperatively, we reject the false choice between ``hand-engineering'' and ``end-to-end'' learning, and instead advocate for an approach which benefits from their complementary strengths. We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them. We present a new building block for the AI toolkit with a strong relational inductive bias--the graph network--which generalizes and extends various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. We discuss how graph networks can support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning. As a companion to this paper, we have released an open-source software library for building graph networks, with demonstrations of how to use them in practice.","month":"jun","year":"2018","archivePrefix":"arXiv","primaryClass":"cs.LG","eprint":"1806.01261","archiveprefix":"arXiv","primaryclass":"cs.LG","type":"ARTICLE"],["Corso2020-py","title":"Principal Neighbourhood Aggregation for Graph Nets","author":"Corso, Gabriele and Cavalleri, Luca and Beaini, Dominique and Lio, Pietro and Velickovic, Petar","abstract":"Graph Neural Networks (GNNs) have been shown to be effective models for different predictive tasks on graph-structured data. Recent work on their expressive power has focused on isomorphism tasks and countable feature spaces. We extend this theoretical framework to include continuous features - which occur regularly in real-world input domains and within the hidden layers of GNNs - and we demonstrate the requirement for multiple aggregation functions in this context. Accordingly, we propose Principal Neighbourhood Aggregation (PNA), a novel architecture combining multiple aggregators with degree-scalers (which generalize the sum aggregator). Finally, we compare the capacity of different models to capture and exploit the graph structure via a novel benchmark containing multiple tasks taken from classical graph theory, alongside existing benchmarks from real-world domains, all of which demonstrate the strength of our model. With this work, we hope to steer some of the GNN research towards new aggregation methods which we believe are essential in the search for powerful and robust models.","month":"apr","year":"2020","archivePrefix":"arXiv","primaryClass":"cs.LG","eprint":"2004.05718","archiveprefix":"arXiv","primaryclass":"cs.LG","type":"ARTICLE"],["Poulovassilis1994-bt","title":"A nested-graph model for the representation and manipulation of complex objects","author":"Poulovassilis, Alexandra and Levene, Mark","journal":"ACM Transactions on Information Systems","volume":"12","number":"1","pages":"35--68","year":"1994","type":"MISC"],["Gao2019-lf","title":"Graph U-Nets","author":"Gao, Hongyang and Ji, Shuiwang","abstract":"We consider the problem of representation learning for graph data. Convolutional neural networks can naturally operate on images, but have significant challenges in dealing with graph data. Given images are special cases of graphs with nodes lie on 2D lattices, graph embedding tasks have a natural correspondence with image pixel-wise prediction tasks such as segmentation. While encoder-decoder architectures like U-Nets have been successfully applied on many image pixel-wise prediction tasks, similar methods are lacking for graph data. This is due to the fact that pooling and up-sampling operations are not natural on graph data. To address these challenges, we propose novel graph pooling (gPool) and unpooling (gUnpool) operations in this work. The gPool layer adaptively selects some nodes to form a smaller graph based on their scalar projection values on a trainable projection vector. We further propose the gUnpool layer as the inverse operation of the gPool layer. The gUnpool layer restores the graph into its original structure using the position information of nodes selected in the corresponding gPool layer. Based on our proposed gPool and gUnpool layers, we develop an encoder-decoder model on graph, known as the graph U-Nets. Our experimental results on node classification and graph classification tasks demonstrate that our methods achieve consistently better performance than previous models.","month":"may","year":"2019","archivePrefix":"arXiv","primaryClass":"cs.LG","eprint":"1905.05178","archiveprefix":"arXiv","primaryclass":"cs.LG","type":"ARTICLE"],["Pope2019-py","title":"Explainability Methods for Graph Convolutional Neural Networks","author":"Pope, Phillip E and Kolouri, Soheil and Rostami, Mohammad and Martin, Charles E and Hoffmann, Heiko","journal":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","year":"2019","type":"MISC"],["Zachary1977-jg","title":"An Information Flow Model for Conflict and Fission in Small Groups","author":"Zachary, Wayne W","abstract":"Data from a voluntary association are used to construct a new formal model for a traditional anthropological problem, fission in small groups. The process leading to fission is viewed as an unequal flow of sentiments and information across the ties in a social network. This flow is unequal because it is uniquely constrained by the contextual range and sensitivity of each relationship in the network. The subsequent differential sharing of sentiments leads to the formation of subgroups with more internal stability than the group as a whole, and results in fission. The Ford-Fulkerson labeling algorithm allows an accurate prediction of membership in the subgroups and of the locus of the fission to be made from measurements of the potential for information flow across each edge in the network. Methods for measurement of potential information flow are discussed, and it is shown that all appropriate techniques will generate the same predictions.","journal":"J. Anthropol. Res.","publisher":"The University of Chicago Press","volume":"33","number":"4","pages":"452--473","month":"dec","year":"1977","type":"ARTICLE"],["Duvenaud2015-yc","title":"Convolutional Networks on Graphs for Learning Molecular Fingerprints","author":"Duvenaud, David and Maclaurin, Dougal and Aguilera-Iparraguirre, Jorge and Gomez-Bombarelli, Rafael and Hirzel, Timothy and Aspuru-Guzik, Alan and Adams, Ryan P","abstract":"We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.","month":"sep","year":"2015","archivePrefix":"arXiv","primaryClass":"cs.LG","eprint":"1509.09292","archiveprefix":"arXiv","primaryclass":"cs.LG","type":"ARTICLE"],["Pennington2014-kg","title":"Glove: Global Vectors for Word Representation","author":"Pennington, Jeffrey and Socher, Richard and Manning, Christopher","journal":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","year":"2014","type":"MISC"],["Velickovic2017-hf","title":"Graph Attention Networks","author":"Velickovic, Petar and Cucurull, Guillem and Casanova, Arantxa and Romero, Adriana and Lio, Pietro and Bengio, Yoshua","abstract":"We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).","month":"oct","year":"2017","eprint":"1710.10903","type":"ARTICLE"],["Vaswani2017-as","title":"Attention Is All You Need","author":"Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia","abstract":"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.","month":"jun","year":"2017","eprint":"1706.03762","type":"ARTICLE"],["Lample2019-jg","title":"Deep Learning for Symbolic Mathematics","author":"Lample, Guillaume and Charton, Francois","abstract":"Neural networks have a reputation for being better at solving statistical or approximate problems than at performing calculations or working with symbolic data. In this paper, we show that they can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations. We propose a syntax for representing mathematical problems, and methods for generating large datasets that can be used to train sequence-to-sequence models. We achieve results that outperform commercial Computer Algebra Systems such as Matlab or Mathematica.","month":"dec","year":"2019","eprint":"1912.01412","type":"ARTICLE"],["McCloskey2018-ml","title":"Using Attribution to Decode Dataset Bias in Neural Network Models for Chemistry","author":"McCloskey, Kevin and Taly, Ankur and Monti, Federico and Brenner, Michael P and Colwell, Lucy","abstract":"Deep neural networks have achieved state of the art accuracy at classifying molecules with respect to whether they bind to specific protein targets. A key breakthrough would occur if these models could reveal the fragment pharmacophores that are causally involved in binding. Extracting chemical details of binding from the networks could potentially lead to scientific discoveries about the mechanisms of drug actions. But doing so requires shining light into the black box that is the trained neural network model, a task that has proved difficult across many domains. Here we show how the binding mechanism learned by deep neural network models can be interrogated, using a recently described attribution method. We first work with carefully constructed synthetic datasets, in which the 'fragment logic' of binding is fully known. We find that networks that achieve perfect accuracy on held out test datasets still learn spurious correlations due to biases in the datasets, and we are able to exploit this non-robustness to construct adversarial examples that fool the model. The dataset bias makes these models unreliable for accurately revealing information about the mechanisms of protein-ligand binding. In light of our findings, we prescribe a test that checks for dataset bias given a hypothesis. If the test fails, it indicates that either the model must be simplified or regularized and/or that the training dataset requires augmentation.","month":"nov","year":"2018","archivePrefix":"arXiv","primaryClass":"cs.LG","eprint":"1811.11310","archiveprefix":"arXiv","primaryclass":"cs.LG","type":"ARTICLE"],["Rozemberczki2020-lq","title":"Little Ball of Fur","author":"Rozemberczki, Benedek and Kiss, Oliver and Sarkar, Rik","journal":"Proceedings of the 29th ACM International Conference on Information & Knowledge Management","year":"2020","type":"MISC"],["Berge1976-ss","title":"Graphs and Hypergraphs","author":"Berge, Claude","publisher":"Elsevier","year":"1976","language":"en","type":"BOOK"],["Harary1969-qo","title":"Graph Theory","author":"Harary, Frank","year":"1969","type":"MISC"],["Zaheer2017-uc","title":"Deep Sets","author":"Zaheer, Manzil and Kottur, Satwik and Ravanbakhsh, Siamak and Poczos, Barnabas and Salakhutdinov, Ruslan and Smola, Alexander","abstract":"We study the problem of designing models for machine learning tasks defined on \\textbackslashemph\\sets\\. In contrast to traditional approach of operating on fixed dimensional vectors, we consider objective functions defined on sets that are invariant to permutations. Such problems are widespread, ranging from estimation of population statistics \\textbackslashcite\\poczos13aistats\\, to anomaly detection in piezometer data of embankment dams \\textbackslashcite\\Jung15Exploration\\, to cosmology \\textbackslashcite\\Ntampaka16Dynamical,Ravanbakhsh16ICML1\\. Our main theorem characterizes the permutation invariant functions and provides a family of functions to which any permutation invariant objective function must belong. This family of functions has a special structure which enables us to design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks. We also derive the necessary and sufficient conditions for permutation equivariance in deep models. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and outlier detection.","month":"mar","year":"2017","eprint":"1703.06114","type":"ARTICLE"],["Kunegis2013-er","title":"KONECT","author":"Kunegis, Jerome","journal":"Proceedings of the 22nd International Conference on World Wide Web - WWW '13 Companion","year":"2013","type":"MISC"],["Zitnik2018-uk","title":"Modeling polypharmacy side effects with graph convolutional networks","author":"Zitnik, Marinka and Agrawal, Monica and Leskovec, Jure","abstract":"Motivation: The use of drug combinations, termed polypharmacy, is common to treat patients with complex diseases or co-existing conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects emerge because of drug-drug interactions, in which activity of one drug may change, favorably or unfavorably, if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity. Results: Here, we present Decagon, an approach for modeling polypharmacy side effects. The approach constructs a multimodal graph of protein-protein interactions, drug-protein target interactions and the polypharmacy side effects, which are represented as drug-drug interactions, where each side effect is an edge of a different type. Decagon is developed specifically to handle such multimodal graphs with a large number of edge types. Our approach develops a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drug-drug interaction values, Decagon can predict the exact side effect, if any, through which a given drug combination manifests clinically. Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69\\%. We find that it automatically learns representations of side effects indicative of co-occurrence of polypharmacy in patients. Furthermore, Decagon models particularly well polypharmacy side effects that have a strong molecular basis, while on predominantly non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types. Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies. Availability and implementation: Source code and preprocessed datasets are at: ","journal":"Bioinformatics","volume":"34","number":"13","pages":"i457--i466","month":"jul","year":"2018","language":"en","type":"ARTICLE"],["Kearnes2016-rl","title":"Molecular graph convolutions: moving beyond fingerprints","author":"Kearnes, Steven and McCloskey, Kevin and Berndl, Marc and Pande, Vijay and Riley, Patrick","abstract":"Molecular ``fingerprints'' encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data-driven decisions. We describe molecular graph convolutions, a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use a simple encoding of the molecular graph-atoms, bonds, distances, etc.-which allows the model to take greater advantage of information in the graph structure. Although graph convolutions do not outperform all fingerprint-based methods, they (along with other graph-based methods) represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.","journal":"J. Comput. Aided Mol. Des.","volume":"30","number":"8","pages":"595--608","month":"aug","year":"2016","keywords":"Artificial neural networks; Deep learning; Machine learning; Molecular descriptors; Virtual screening;references.bib","language":"en","type":"ARTICLE"],["Kipf2016-ky","title":"Variational Graph Auto-Encoders","author":"Kipf, Thomas N and Welling, Max","abstract":"We introduce the variational graph auto-encoder (VGAE), a framework for unsupervised learning on graph-structured data based on the variational auto-encoder (VAE). This model makes use of latent variables and is capable of learning interpretable latent representations for undirected graphs. We demonstrate this model using a graph convolutional network (GCN) encoder and a simple inner product decoder. Our model achieves competitive results on a link prediction task in citation networks. In contrast to most existing models for unsupervised learning on graph-structured data and link prediction, our model can naturally incorporate node features, which significantly improves predictive performance on a number of benchmark datasets.","month":"nov","year":"2016","eprint":"1611.07308","type":"ARTICLE"],["You2018-vx","title":"GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models","author":"You, Jiaxuan and Ying, Rex and Ren, Xiang and Hamilton, William L and Leskovec, Jure","abstract":"Modeling and generating graphs is fundamental for studying networks in biology, engineering, and social sciences. However, modeling complex distributions over graphs and then efficiently sampling from these distributions is challenging due to the non-unique, high-dimensional nature of graphs and the complex, non-local dependencies that exist between edges in a given graph. Here we propose GraphRNN, a deep autoregressive model that addresses the above challenges and approximates any distribution of graphs with minimal assumptions about their structure. GraphRNN learns to generate graphs by training on a representative set of graphs and decomposes the graph generation process into a sequence of node and edge formations, conditioned on the graph structure generated so far. In order to quantitatively evaluate the performance of GraphRNN, we introduce a benchmark suite of datasets, baselines and novel evaluation metrics based on Maximum Mean Discrepancy, which measure distances between sets of graphs. Our experiments show that GraphRNN significantly outperforms all baselines, learning to generate diverse graphs that match the structural characteristics of a target set, while also scaling to graphs 50 times larger than previous deep models.","month":"feb","year":"2018","archivePrefix":"arXiv","primaryClass":"cs.LG","eprint":"1802.08773","archiveprefix":"arXiv","primaryclass":"cs.LG","type":"ARTICLE"],["Devlin2018-mi","title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","author":"Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina","abstract":"We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5\\% (7.7\\% point absolute improvement), MultiNLI accuracy to 86.7\\% (4.6\\% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).","month":"oct","year":"2018","eprint":"1810.04805","type":"ARTICLE"],["Liao2019-kf","title":"Efficient Graph Generation with Graph Recurrent Attention Networks","author":"Liao, Renjie and Li, Yujia and Song, Yang and Wang, Shenlong and Nash, Charlie and Hamilton, William L and Duvenaud, David and Urtasun, Raquel and Zemel, Richard S","abstract":"We propose a new family of efficient and expressive deep generative models of graphs, called Graph Recurrent Attention Networks (GRANs). Our model generates graphs one block of nodes and associated edges at a time. The block size and sampling stride allow us to trade off sample quality for efficiency. Compared to previous RNN-based graph generative models, our framework better captures the auto-regressive conditioning between the already-generated and to-be-generated parts of the graph using Graph Neural Networks (GNNs) with attention. This not only reduces the dependency on node ordering but also bypasses the long-term bottleneck caused by the sequential nature of RNNs. Moreover, we parameterize the output distribution per block using a mixture of Bernoulli, which captures the correlations among generated edges within the block. Finally, we propose to handle node orderings in generation by marginalizing over a family of canonical orderings. On standard benchmarks, we achieve state-of-the-art time efficiency and sample quality compared to previous models. Additionally, we show our model is capable of generating large graphs of up to 5K nodes with good quality. To the best of our knowledge, GRAN is the first deep graph generative model that can scale to this size. Our code is released at: ","month":"oct","year":"2019","eprint":"1910.00760","type":"ARTICLE"],["Dumoulin2018-tb","title":"Feature-wise transformations","author":"Dumoulin, Vincent and Perez, Ethan and Schucher, Nathan and Strub, Florian and Vries, Harm de and Courville, Aaron and Bengio, Yoshua","abstract":"A simple and surprisingly effective family of conditioning mechanisms.","journal":"Distill","volume":"3","number":"7","pages":"e11","month":"jul","year":"2018","type":"ARTICLE"],["Lee2018-ti","title":"Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks","author":"Lee, Juho and Lee, Yoonho and Kim, Jungtaek and Kosiorek, Adam R and Choi, Seungjin and Teh, Yee Whye","abstract":"Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model in