This is a blog post rewritten from a presentation at NYC Machine Learning last week. It covers a library called Annoy that I have built that helps you do (approximate) nearest neighbor queries in high dimensional spaces. I will be splitting it into several parts. This first talks about vector models, how to measure similarity, and why nearest neighbor queries are useful.
Vector models are increasingly popular in various applications. They have been used in natural language processing for a long time using things like LDA and PLSA (and even earlier using TF-IDF in raw space). Recently there has been a new generation of models: word2vec, RNN's, etc.
(As a side note: much has been written about word2vec's ability to do word analogies in vector space. This is a powerful demonstration of the structure of these vector spaces, but the idea of using vector spaces is old and similarity is arguably much more useful).
The MNIST dataset features 60,000 images of size 2828. They each feature a handwritten digits in grayscale. One of the most basic ways we can play around with this data set is to smash each 2828 array into a 784-dimensional vector. There is absolutely no machine learning involved in doing this, but we will get back and introduce cool stuff like neural networks and word2vec later.
Dimensionality reduction is an extremely powerful technique because it lets us take almost any object and translate it to a small convenient vector representation in a space. This space is generally referred to as latent because we don't necessarily have any prior notion of what the axes are. What we care about is that objects that are similar end up being close to each other. What do we mean with similarity? In a lot of cases we can actually discover that from our data.
Using the neural network as an embedding function and using cosine similarity as a metric (this is basically Euclidean distance, but normalize the vectors first) we get some quite cool nearest neighbors:
In order to study the molecular pathways of Parkinson's disease (PD) and to develop novel therapeutic strategies, scientific investigators rely on animal models. The identification of PD-associated genes has led to the development of genetic PD models as an alternative to toxin-based models. Viral vector-mediated loco-regional gene delivery provides an attractive way to express transgenes in the central nervous system. Several vector systems based on various viruses have been developed. In this chapter, we give an overview of the different viral vector systems used for targeting the CNS. Further, we describe the different viral vector-based PD models currently available based on overexpression strategies for autosomal dominant genes such as α-synuclein and LRRK2, and knockout or knockdown strategies for autosomal recessive genes, such as parkin, DJ-1, and PINK1. Models based on overexpression of α-synuclein are the most prevalent and extensively studied, and therefore the main focus of this chapter. Many efforts have been made to increase the expression levels of α-synuclein in the dopaminergic neurons. The best α-synuclein models currently available have been developed from a combined approach using newer AAV serotypes and optimized vector constructs, production, and purification methods. These third-generation α-synuclein models show improved face and predictive validity, and therefore offer the possibility to reliably test novel therapeutics.
df19127ead