23andMe exomes

186 views
Skip to first unread message

Bastian Greshake

unread,
May 17, 2012, 8:56:41 AM5/17/12
to diy...@googlegroups.com
Hi there,
I'm currently working on adding support for 23andMe-exomes to openSNP.org I'd like to start by adding and parsing the VCF-files which are provided through 23andMe. Unfortunately I'm personally not among the exome-customers, so I'm in a dire need for some files to test my parsing-scripts.

Did some of you maybe got their exome sequenced through 23andMe and could provide me with their VCF-file? I won't publish the data, I just need them for unit-testing on my development machine.

Cheers,
Bastian

--
// Bastian Greshake
// Zehnthofstraße 36
// 55252 Mainz-Kastel, Germany
// cell: +49 176 213 044 66
// web: www.ruleofthirds.de






Nathan McCorkle

unread,
May 17, 2012, 10:02:37 AM5/17/12
to diy...@googlegroups.com

Ooo I didn't know 23andme started providing exome sequencing. COOL!

--
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To post to this group, send email to diy...@googlegroups.com.
To unsubscribe from this group, send email to diybio+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/diybio?hl=en.

Bastian Greshake

unread,
May 17, 2012, 10:57:24 AM5/17/12
to diy...@googlegroups.com
The still don't do it as a regular service as far as I know, but they do it for $999: https://www.23andme.com/exome/

Cheers,
Bastian

Nathan McCorkle

unread,
May 17, 2012, 11:10:14 AM5/17/12
to diy...@googlegroups.com

What do you mean it's not a regular service?

Nathan McCorkle

unread,
May 17, 2012, 11:12:49 AM5/17/12
to diy...@googlegroups.com

Ah sorry spoke too soon before reading their page

Madeleine Ball

unread,
May 17, 2012, 1:00:25 PM5/17/12
to diy...@googlegroups.com
Hi,

The Personal Genome Project has a couple participants that have donated their 23andme exome data to be public through our IRB-approved open consent process. You can find the VCF files linked as "source data" on the interpretations here: http://evidence.personalgenomes.org/genomes  You can also find them linked on individual profile pages, e.g.: https://my.personalgenomes.org/profile/hu97DB4A

The VCF format might be frustrating to you in that it fails to distinguish between "not sufficiently covered to make a genotype call" and "matches the reference genome". (It is theoretically possible to report this, but to date it's only been done in a base-by-base manner, which results in ridiculously large files.)

  -- Madeleine

Madeleine Ball

unread,
May 17, 2012, 1:04:50 PM5/17/12
to diy...@googlegroups.com
Hi,

The Personal Genome Project has a couple participants who have donated their 23andme exome data to be public through our IRB-approved open consent process. You can find the VCF files linked as "source data" from their genome reports here: http://evidence.personalgenomes.org/genomes and also on their profile pages (e.g. https://my.personalgenomes.org/profile/hu97DB4A).

You might find the VCF file format frustrating, as it fails to distinguish between "insufficiently covered to make a genotyping call" and "matches the reference genome". (It is theoretically possible to report this, but as far as I'm aware it's only done as base-by-base report, which makes a ridiculously large file.)


  -- Madeleine

On Thursday, May 17, 2012 8:56:41 AM UTC-4, Bastian Greshake wrote:

Bastian Greshake

unread,
May 17, 2012, 3:44:02 PM5/17/12
to diy...@googlegroups.com
Hi,

On May 17, 2012, at 19:00 , Madeleine Ball wrote:

> The Personal Genome Project has a couple participants that have donated their 23andme exome data to be public through our IRB-approved open consent process. You can find the VCF files linked as "source data" on the interpretations here: http://evidence.personalgenomes.org/genomes You can also find them linked on individual profile pages, e.g.: https://my.personalgenomes.org/profile/hu97DB4A

this is great. I totally forgot to look at the PGP-data for this, my bad. m(

> The VCF format might be frustrating to you in that it fails to distinguish between "not sufficiently covered to make a genotype call" and "matches the reference genome". (It is theoretically possible to report this, but to date it's only been done in a base-by-base manner, which results in ridiculously large files.)

This sounds bad. So the 23andMe-files also report those SNPs which haven't been called as a match to the reference? An example out of one of the VCF-files of the PGP:


1 753405 rs61770173 C A 203.64 […excluded…] GT:AD:DP:GQ:PL 0/1:17,16:34:99:234,0,489


So in this case the genotype is given as 0/1, an unphased C/A, but it also could be that the allele which is given as C wasn't called at all? But there is also information about the read-depth (DP), Genotype-Quality (GQ) and a phred-scaled likelihood (PL), so I could use those to determine how accurate the genotype-calling was, couldn't I?

I should mention: For openSNP we currently only aim to read the known SNPs (those with Rs-IDs).

Madeleine Ball

unread,
May 17, 2012, 6:52:06 PM5/17/12
to diy...@googlegroups.com
> The VCF format might be frustrating to you in that it fails to distinguish between "not sufficiently covered to make a genotype call" and "matches the reference genome". (It is theoretically possible to report this, but to date it's only been done in a base-by-base manner, which results in ridiculously large files.)

This sounds bad. So the 23andMe-files also report those SNPs which haven't been called as a match to the reference?

They don't report the SNPs which have been called as a homozygous match to reference. (Not sure if this was what you said.)
 
An example out of one of the VCF-files of the PGP:


1       753405  rs61770173      C       A       203.64  […excluded…]     GT:AD:DP:GQ:PL  0/1:17,16:34:99:234,0,489


So in this case the genotype is given as 0/1, an unphased C/A, but it also could be that the allele which is given as C wasn't called at all? But there is also information about the read-depth (DP), Genotype-Quality (GQ) and a phred-scaled likelihood (PL), so I could use those to determine how accurate the genotype-calling was, couldn't I?

Ah, you've cut out a lot of the important data in that line. I know, the "INFO" field is ugly... I think these catch-all fields end up getting used a lot because people didn't really know what they wanted when they invented the format? :-) You can find a description of VCF format here, btw:

The "C" is the variant, the "A" is the reference. This person is either a homozygous or heterozygous carrier of the variant. If the person had "A" homozygously, this line would not be reported.

From the full line:
1 753405 rs61770173 C A 445.79 MQFilter40 AC=2;AF=1.00;AN=2;DB;DP=20;Dels=0.0;FS=0.0;HRun=0;HaplotypeScore=0.0;MQ=28.45;MQ0=1;QD=22.29;SNPEFF_EFFECT=TRANSCRIPT;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=lincRNA;SNPEFF_GENE_NAME=FAM87B;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000326734 GT:AD:DP:GQ:PL 1/1:0,17:20:45.08:479,45,0

I interpret this as saying: AC=2 ... allele count is two, AF=1.00 allele frequency of the variant is "100%" ... so this position is homozygous for the variant in this data. What I believe to be a het line from another PGP exome:

1 753405 rs61770173 C A 203.64 MQFilter40 AB=0.515;AC=1;AF=0.50;AN=2;BaseQRankSum=2.916;DB;DP=34;Dels=0.0;FS=0.0;HRun=0;HaplotypeScore=0.0;MQ=38.34;MQ0=3;MQRankSum=-3.837;QD=5.99;ReadPosRankSum=0.197;SNPEFF_EFFECT=TRANSCRIPT;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=lincRNA;SNPEFF_GENE_NAME=FAM87B;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000326734 GT:AD:DP:GQ:PL 0/1:17,16:34:99:234,0,489

I should mention: For openSNP we currently only aim to read the known SNPs (those with Rs-IDs).

... Now up to 41 million (dbSNP 134)? It's getting close to having nearly every variant in a genome as "known" by dbSNP. There's only around 3 million nonreference variants in a given genome, so at some point reporting genotypes for "all dbSNP IDs" gets worse than simply reporting the positions that are non-reference.

Madeleine Ball

unread,
May 17, 2012, 6:53:30 PM5/17/12
to diy...@googlegroups.com
Oops, I mean the "C" is reference and "A" is variant -- swap all my letters, I got the columns confused, I should've reviewed that link first!

Bastian Greshake

unread,
Jun 4, 2012, 3:30:08 AM6/4/12
to diy...@googlegroups.com
Hey there,
this may be of interest to some of you: We've now added support for exome-data into openSNP. For practical reasons we've limited ourselves to the results in the Variant Call Format (VCF) which is supplied for example by the 23andMe-exome-service. Basically just like we've discussed in this thread. You can find a bit more details on our decision to limit it this way in our blogpost on it: http://opensnp.wordpress.com/2012/05/31/support-for-23andme-exome-data/

If you got your exome done through another provider (or did it yourself ;)) you can also upload your data, as long as the VCF-format is roughly the same. For an example VCF-file have a look at http://opensnp.org/data/42.23andme-exome-vcf.205

Have fun playing around with the exome data and please consider sharing yours :-)

Cheers,
Bastian
Reply all
Reply to author
Forward
0 new messages