Updates: Studies, PCA, Global Alliance

Bastian Greshake

unread,

Jun 11, 2015, 3:45:18 PM6/11/15

to snpr-dev...@googlegroups.com

Hey there,

so by now I’m nearly back from this month’s conference circus and would like to give you a small update. (figured out it’s not that small. Ok, there’s a tl;dr at the end).

# PCA

Over the last weekend I was in Zurich for the research data hackdays there. The group I led decided to try doing a PCA to cluster the data sets that are available on openSNP and to see where people come from originally. This is how far we came: http://make.opendata.ch/wiki/project:opensnp#to-do

As you can see we didn’t get too far, but we managed to reformat all the 23andMe data sets, clean it up a bit and did the PCA with the openSNP data. It’s not super informative, but right now I’m trying to add the 1000 Genomes data into the mix. These have known ancestry and could work as guiding populations compared to openSNP.

If that should work out nicer: Do you think that we should somehow integrate this into the website? It would offer a sweet visualization and you could also use it to a) color the data sets by phenotypes to see whether some phenotypes cluster according to ancestry and b) you could use it to link to users. So you could on individual user pages show the PCA neighborhood to find closely related data sets

# Studies w/ openSNP data

While in Zurich I also talked to Ulrich (with whom we’re planning the anosmia study) and with Effy (with whom we are doing the survey amongst openSNP users). There’s little progress on the former study (largely because the ethics stuff still isn’t sorted out, but I’m still optimistic that we will get there) but there’s more on the latter. Effy and I spent a morning going over the latest draft of the survey and it should be final pretty soon. But we think it might be best to not send out the emails with the survey over the summer, as people are likely on vacation and might miss the email.

There also was this study: http://arxiv.org/abs/1405.1891 which mentions us quite a lot :-)

# Global Alliance: Lighting a Beacon

From Monday to Today I was in Leiden at the Plenary Meeting of the “Global Alliance for Genomics & Health”, they are having this little API thingy called “Beacons” which are basically just a proof of concept. The idea of them was to create the simplest possible genomic API. So what you can query for is basically just to ask whether a database has a dataset that contains a given allele at a given position.

So you can ask “does openSNP have a data set with an A at chromosome 3, position 565343?” and the answer is YES/NO/NONE(in case things went wrong somewhere). I think it would be fun if openSNP would offer this, because it’s easily done and shows our support for the GA. I already did implement it in a new branch: https://github.com/gedankenstuecke/snpr/pull/177 Would be nice to get comments from you on it.

# TL;DR

- We did a PCA in Zurich, shall we include this somehow in openSNP?

- all current studies are still work in progress

- I did a reference implementation of the GA Beacon API, shall we put this live?

Cheers,

Bastian

signature.asc

Philipp Bayer

unread,

Jun 11, 2015, 5:10:18 PM6/11/15

to snpr-dev...@googlegroups.com

Very nice work :)

#PCA

As for the PCA: PLINK also offers an older HapMap release in PLINK
format, it's much much easier to merge with your data:
http://pngu.mgh.harvard.edu/~purcell/plink/res.shtml#hapmap

I've also fiddled with 1000 genomes data in the past, IMHO it's best to
use tabix to download only the SNPs you need, not the entire dataset as
that one's gigantic (and also contains indels etc.)

like for example in this ugly piece of code:
https://gist.github.com/philippbayer/ee424b4e76e6d6e7a71c

that will give you a bunch of VCFs, one for each chromosome. You can
then load the vcf with plink-1.9 or use vcftools to convert the files
into ped/map, like

./vcftools --vcf input_data.vcf --plink --chr 1 --out output_in_plink

The overall picture itself would be a good addition to the stats page,
it would be fun if each user-show page would have a "here's where you
are on the map" picture, that's ideally interactive...

I've also had results similar to EIG with fastStructure:
https://github.com/rajanil/fastStructure It's much much easier to use,
but you can't directly get a nice plot like with EIG

The most annoying part about BEACON is that you can't ask for a specific
genome release :) I guess it always assumes the newest one?

> --
> You received this message because you are subscribed to the Google
> Groups "SNPr development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to snpr-developme...@googlegroups.com
> <mailto:snpr-developme...@googlegroups.com>.
> To post to this group, send email to snpr-dev...@googlegroups.com
> <mailto:snpr-dev...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/snpr-development.
> For more options, visit https://groups.google.com/d/optout.

Bastian Greshake

unread,

Jun 11, 2015, 5:23:38 PM6/11/15

to snpr-dev...@googlegroups.com

On 11 Jun 2015, at 23:10, Philipp Bayer <phili...@gmail.com> wrote:

As for the PCA: PLINK also offers an older HapMap release in PLINK
format, it's much much easier to merge with your data:
http://pngu.mgh.harvard.edu/~purcell/plink/res.shtml#hapmap

Nice, I think I might just go for those then. Or do you think there’s a large benefit in using 1k Genomes instead?

like for example in this ugly piece of code:
https://gist.github.com/philippbayer/ee424b4e76e6d6e7a71c

Sweet, I used the tool provided by 1k Genomes and it sucks because it’s so slow that you virtually can do it manually…

The overall picture itself would be a good addition to the stats page,
it would be fun if each user-show page would have a "here's where you

are on the map" picture, that's ideally interactive…

Yes, for that reason I think it doesn’t matter whether we are using EIG or fastStructure, because we need to do the plotting later using d3 or whatever anyway. Do you have more good ideas for how to run a PCA like this? Because last weekend was the first time I ever actually tried doing popgen like this :D

The most annoying part about BEACON is that you can't ask for a specific
genome release :) I guess it always assumes the newest one?

Yes, right now it’s agnostic with respect to the build/version. But the “beacon of beacon” that lists different beacons requires such data for registration, so at least there you can filter.

btw. would love for someone of you to tell me whether the tests I wrote for the beacon thing actually do what I intended to create as I’m a total noob to this :D

Cheers,

signature.asc

Philipp Bayer

unread,

Jun 11, 2015, 5:39:04 PM6/11/15

to snpr-dev...@googlegroups.com

Hi,

The HapMap data is much smaller with much less individuals compared to
the 1000 Genomes data, that's the biggest difference :)

I've only played around with fastStructure and EIGENSTRAT, which both
gave me relatively similar results in terms of which person was close to
whom. fastStructure is easier to use, but I haven't yet found out how to
get nice PCA-like plots from there, only DISTRUCT like plots (
http://web.stanford.edu/group/rosenberglab/distructExample.html ) , in
which a user would be a vertical bar in the overall bar-plot, but they
are less intuitive.

You can also run the PCA in R or even ruby
(https://github.com/gbuesing/pca ), all you have to do is to transform
your matrix of "letter" alleles into numbers first.

What you can do to speed up everything is to use only a subset (100?) of
SNPs, the AIMs (ancestrally informative markers) as explained here:
http://www.nature.com/ejhg/journal/v22/n10/full/ejhg20141a.html

For them 40-80 or 118 AIMs are enough to fiddle apart everything, so you
could discard the large majority of SNPs. You can either calculate these
yourself like in the above paper or use any of the given lists in the
literature (like
http://www.ncbi.nlm.nih.gov/pubmed/12579416?dopt=Abstract&holding=npg )

No idea about the tests ;)

On 11.06.2015 23:23, 'Bastian Greshake' via SNPr development wrote:
>
>> On 11 Jun 2015, at 23:10, Philipp Bayer <phili...@gmail.com

Bastian Greshake

unread,

Jun 12, 2015, 3:41:56 AM6/12/15

to snpr-dev...@googlegroups.com

Hey there,

> On 11 Jun 2015, at 23:39, Philipp Bayer <phili...@gmail.com> wrote:
>
> Hi,
>
> The HapMap data is much smaller with much less individuals compared to
> the 1000 Genomes data, that's the biggest difference :)

I see, do you think it might still work for our purposes? Though, if we go down to like 200 informative SNPs we might easily use the 1000Genomes again, only have to preprocess those once. It would also make things easier because we could pick 200 SNPs which are present in Ancestry, 23andMe & ftDNA. Are there ready to use lists for this?

> You can also run the PCA in R or even ruby
> (https://github.com/gbuesing/pca ), all you have to do is to transform
> your matrix of "letter" alleles into numbers first.

I think it should be fine to use smartPCA, it was rather fast to use and is at least somewhat standard it seems?

One last thing: I think we haven’t done it so far and also not discussed it (iirc), but do you guys think that we should join the Global Alliance 4 Health & Genomics? We’re not fitting in with the clinical-health stuff right away, but I think the general idea of standardising genotype/genome/phenotype-data and access to it is something that’s also really useful outside the clinic for applications like openSNP & citizen science in general. :-)

Cheers,
Bastian

signature.asc

Reply all

Reply to author

Forward