One way to analyze 23andme raw data

1,708 views
Skip to first unread message

Philip Goetz

unread,
May 25, 2011, 8:25:22 PM5/25/11
to diy...@googlegroups.com, biocu...@googlegroups.com
I interpreted a relative's genotype from 23andme, using the raw datafile,
looking up the SNPs in dbSNP to get the gene name and the SNP type,
and looking up the gene name in OMIM to find its dominance pattern and
associated diseases.
(After doing all this, someone told me about Promethease
(http://www.snpedia.com/index.php/Promethease), which may be better.)

23andme says (https://www.23andme.com/you/faqwin/dataaccuracy/) their technology
has over 99.9% repeatability. That means, if you get genotyped twice,
99.9% of the results will be the same.

That does not mean 99.9% accuracy. Particular DNA sequences might be biased
to be likely to be repeatably incorrect. This was the case with
Affymetrix arrays.
I'm not familiar with the 23andme technology.

This page:
http://www.chromosomechronicles.com/2010/03/27/analysis-of-23andmes-genotyping-high-accuracy-of-illumina-platform-confirmed-by-comparing-siblings/
claims that the SNP calls are 99.15% accurate.
There are about 1 million SNP calls in the 23andme data.
This would mean there are about 8500 incorrect SNP calls.

423586 of the 956734 SNPs in the report are listed in dbSNP.
I don't know why over half of the given rs numbers are not listed in
dbSNP; that shouldn't happen.
Only about 70,000 of the SNPs on the chip are non-synonymous
changes in protein (other changes may be important, but we seldom
know the importance of specific SNPs outside genes).
Only 3370 of the SNPs on the chip can be looked up in dbSNP,
and are in a named gene that has an entry in OMIM.

So we expect to find about (1-.9915) x 3370 = 28.6 false positives
(false genotypes reported at SNP locations).
Most alleles are homozygous;
so about 3/4 of these false positives will mismatch the disease allele.
Some that match, will have a heterozygous match to a recessive allele.
2603/3370 have dominance info; 2/3 of these are dominant.
If 1/4 of genes are heterozygous, screening out recessives will remove about
2603/3370 x 1/3 x 1/4 x 28.6 = 1.84 of the false positives.
So we expect 28.6 x 1/4 - 1.84 = 5.3 false positive disease indications.
(Note dominance should be applied per mutation rather than per gene,
but I don't have that info.)

I found zero implicated disease mutations in my data.
(This was not due to bugs in the program filtering out all results.)

Sequencing a dozen genes with indicated SNPs
and checking the accuracy of the SNPs in them
is something that would be cheap and worth doing.
Neither 23andme nor Illumina provides any data
on the accuracy of the SNP chip! Only on reproducability.
(23andme doesn't even tell you exactly which Illumina chip they use.)


This is the process I used:

0. Download your file from 23andme.

1. Download dbSNP from NCBI at ftp://ftp.ncbi.nih.gov/snp/.
You only need database/shared_data/SnpFunctionCode.bcp.gz
and database/organisms/human_9606/database/organism_data/b132_SNPContigLocusId_37_1.bcp.gz
.

2. Read the explanation of SNPContigLocusId at
http://www.ncbi.nlm.nih.gov/projects/SNP/snp_db_table_description.cgi?t=SNPContigLocusId

3. Build a database from b132_SNPContigLocusId_37_1.bcp.
Save snp_id, locus_symbol (gene name when in a gene), protein_acc,
fxn_class, allele.
Index by snp_id.

4. Download OMIM from ftp://ftp.ncbi.nih.gov/repository/OMIM/ARCHIVE/ .

5. Build a DB from OMIM/ARCHIVE/genebank and omim.txt.
Read dominance info from omim.txt.
Read other info from genebank.
Read genebank.key for description.
Save locus, gene, and disorder.
(Join the 3 disorder fields together, throwing out fields that are
just a space.)
Parse the gene field into individual gene symbols.

6. Parse your 23andme file.
For each rs# in the file, strip off the 'rs' and look up the number in
your SNPContigLocusId db.
Retrieve the locus_symbol, allele, and fxn_class.
If fxn_class is in the range 41-44 (bad mutations that destroy the protein),
look up the locus_symbol in omim.
If you find it, retrieve the associated dominance and disorder.
Check the genotype from your 23andme data against
the allele of the SNP and the dominance info (if any).
If it is possible for the individual to have a disease caused by this
mutation, print out the SNP info, including disease;
and add the associated diseases to your list of possible diseases.

Here is Perl code to build the databases.
You'll need to install all referenced Perl modules, and also DBD::SQLite,
which is magically invoked without being referenced.

#!/usr/bin/perl -w

# Construct an SQLite DB from the NCBI dbSNP SNPContigLocusId table
# See http://www.ncbi.nlm.nih.gov/projects/SNP/snp_db_table_description.cgi?t=SNPContigLocusId
# Construct an SQLite DB from OMIM
# Copyright 2011 by Phil Goetz

use strict;
use Carp qw(cluck);
# This awesome line makes errors give a stack trace:
local $SIG{__WARN__} = \&Carp::cluck;
use DBI;
use Getopt::Long qw(:config no_ignore_case no_auto_abbrev);
use lib 'lib';
use dbiSqlite; # library in 'lib/dbiSqlite.pm' with more DB functions

## Declare and define all constants

my $CliFile = "b132_SNPContigLocusId_37_1.bcp";
my $CliDir = "/data/gene/SNP/dbSNP/organisms/human_9606/database/organism_data";
my $Db; # Target database
my $Dbdir = 'data'; # directory to create databases in
my $DbSnp = 1;
my $Omim = 1;
my $OmimDir = '/data/gene/OMIM/ARCHIVE';
my $Test = 0;

## Read command-line arguments
GetOptions(
'dbdir:s' => \$Dbdir,
'dbsnp!' => \$DbSnp,
'cli:s' => \$CliFile,
'clidir:s' => \$CliDir,
'omim!' => \$Omim,
'omimdir' => \$OmimDir,
'test!' => \$Test,
);

mkdir($Dbdir) if !-d $Dbdir;
&makeDb($Dbdir, 'snp', \&dbSnp) if $DbSnp;
&makeDb($Dbdir, 'omim', \&omim) if $Omim;

exit(0);


########################


sub makeDb {
my ($dbdir, $db, $subRef) = @_;

my $dbfile = "$dbdir/$db";
print "Making $dbfile\n";
if (-e $dbfile) {
unlink($dbfile);
}

# Open connection to new db
my $dbh = &connectSqlite($dbdir, $db)
or die "Could not create SQLite DB $dbfile";
$subRef->($dbh);

print "Committing $dbfile\n";
$dbh->commit() or die "Could not commit inserts into $db";
$dbh->disconnect;
}


# $snpH: Database handle
sub dbSnp {
my ($snpH) = @_;

# Create table
my $cmd = "CREATE TABLE locusid (" .
"snp_id INTEGER, " .
"contig_acc VARCHAR(15), " .
#"contig_ver INTEGER, " .
"asn_from INTEGER, " .
"asn_to INTEGER, " .
"locus_id INTEGER, " .
"locus_symbol VARCHAR(50), " .
"mrna_acc VARCHAR(17), " .
"protein_acc VARCHAR(17), " .
"fxn_class INTEGER, " .
"reading_frame INTEGER, " .
"allele VARCHAR(25), " . # nucleotide present in SNP
"residue VARCHAR(5), " .
"aa_position INTEGER, " .
#"build_id VARCHAR(10), " .
"ctg_id INTEGER, " .
"mrna_pos INTEGER, " .
"codon VARCHAR(3) " .
")";
&makeTable($snpH, $cmd);

my $InsertCLI = "INSERT INTO locusid ( " .
"'snp_id', 'contig_acc', 'asn_from', 'asn_to', " .
"'locus_id', 'locus_symbol', 'mrna_acc', 'protein_acc', " .
"'fxn_class', 'reading_frame', 'allele', 'residue', " .
"'aa_position', 'ctg_id', 'mrna_pos', 'codon' " .
") VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )";

&makeSnp($snpH, "$CliDir/$CliFile", $InsertCLI);

&makeIndex($snpH, 'locusid', 'snp_id');
&makeIndex($snpH, 'locusid', 'locus_symbol');
&makeIndex($snpH, 'locusid', 'protein_acc');
&makeIndex($snpH, 'locusid', 'asn_from');
}


sub makeSnp {
my ($snpH, $file, $insert) = @_;

print "Reading $file\n";
open(my $IN, "<$file") or die "Could not open $file";
my $count = 0;
while ((my $line = <$IN>) && (!$Test || $count++ < 1000)) {
chomp($line);
my ($snp_id, $contig_acc, $contig_ver, $asn_from, $asn_to,
$locus_id, $locus_symbol, $mrna_acc, $mrna_ver,
$protein_acc, $protein_ver, $fxn_class, $frame, $allele,
$residue, $aapos, $build_id, $ctg_id, $mrna_pos0, $mrna_pos1, $codon)
= split(/\t/, $line);
$mrna_acc = "$mrna_acc.$mrna_ver" if $mrna_acc;
$protein_acc = "$protein_acc.$protein_ver" if $protein_acc;
&sql($snpH, $insert,
$snp_id, $contig_acc, $asn_from, $asn_to,
$locus_id, $locus_symbol, $mrna_acc,
$protein_acc, $fxn_class, $frame, $allele,
$residue, $aapos, $ctg_id, $mrna_pos0, $codon);
}
close($IN);
print "Done reading $file\n";
}


# Make OMIM DB
# $omimH: Database handle
sub omim {
my ($omimH) = @_;

my $cmd = "CREATE TABLE genemap (" .
"id INTEGER, " . # how we link our tables
"date INTEGER, " .
"loc VARCHAR(15), " .
"status VARCHAR(1), " .
"mim INTEGER, " .
"dominance VARCHAR(1), " . # 'd', 'r', or undef/NULL
"disorder VARCHAR, " .
# mim is NOT unique
#"CONSTRAINT id_uniq UNIQUE (mim))";
"CONSTRAINT id_uniq UNIQUE (id) )";
&makeTable($omimH, $cmd);

$cmd = "CREATE TABLE gene (" .
"id INTEGER, " .
"mim INTEGER, " .
"gene VARCHAR)";
&makeTable($omimH, $cmd);

&makeOmim($omimH, 'genemap');

&makeIndex($omimH, 'genemap', 'id');
&makeIndex($omimH, 'gene', 'id');
&makeIndex($omimH, 'gene', 'gene');
}


sub makeOmim {
my ($dbh, $table) = @_;

# First, read the omim.txt file to try to pick up dominant/recessive info
# Sadly, this is not a very machine-readable file
my $file = "$OmimDir/omim.txt";
my $count = 0;
my $mim;
my %dominance; # $dominance{$mim} = 'd' for dominant; 'r' for recessive
print "Reading $file\n";
open(my $OMIM, "<$file") or die "Could not open $file";
while ((my $line = <$OMIM>) && (!$Test || $count++ < 10000)) {
chop($line) while ($line =~ m/[\n\r]$/);
if ($line eq '*FIELD* NO') {
$line = <$OMIM>; # MIM number is on next line
chop($line) while ($line =~ m/[\n\r]$/);
$mim = $line;
}
elsif ($line eq 'INHERITANCE:') {
$line = <$OMIM>; # MIM number is on next line
chop($line) while ($line =~ m/[\n\r]$/);
if ($line =~ m/dominant/i) {
$dominance{$mim} = 'd';
}
elsif ($line =~ m/recessive/i) {
$dominance{$mim} = 'r';
}
elsif ($line =~ m/isolated cases/i) {
$dominance{$mim} = 'i';
}
elsif ($line =~ m/multifactorial/i) {
$dominance{$mim} = 'm';
}
elsif ($line =~ m/somatic/i) {
$dominance{$mim} = 's';
}
elsif ($line =~ m/X-linked/i) {
$dominance{$mim} = 'x';
}
elsif ($line =~ m/Y-linked/i) {
$dominance{$mim} = 'y';
}
elsif ($line =~ m/mitochondrial/i) {
$dominance{$mim} = 'M';
}
else {
print "Unknown inheritance: $line\n";
}
}
}
close($OMIM);

# Now read genemap
$file = "$OmimDir/$table";
my $insert = "INSERT INTO $table ('id', 'date', 'loc', 'status',
'mim', 'dominance', 'disorder')" .
" VALUES ( ?, ?, ?, ?, ?, ?, ? )";
my $insertGene = "INSERT INTO gene ( 'id', 'mim', 'gene' ) VALUES (?, ?, ?)";

print "Reading $file\n";
open(my $IN, "<$file") or die "Could not open $file";
my $id = 1;
while (my $line = <$IN>) {
chop($line) while ($line =~ m/[\n\r]$/);
# Disorder is split arbitrarily into up to 3 fields
my ($num, $month, $day, $year, $loc, $gene, $status, $title, $foo,
$mim, $method, $comments, $thirteen, $diso14, $diso15, $diso16)
= split(/\|/, $line);
my @disos;
foreach my $diso ($diso14, $diso15, $diso16) {
# Skip empty fields, some of which have a space
push(@disos, $diso) if ($diso && length($diso) > 1);
}
my $disorder = join('', @disos);
my $date = "$year$month$day";
if ($loc eq 'Chr.X') {
if ($dominance{$mim} && $dominance{$mim} ne 'x') {
print "NOTE: $mim is on $loc yet says dominance=$dominance{$mim}\n";
}
else {
$dominance{$mim} = 'x'; # not sure how to interpret this
print "NOTE: Calling $mim dominance='x'\n";
}
}
# Insert most of data into main table
&sql($dbh, $insert, $id, $date, $loc, $status, $mim,
$dominance{$mim}, $disorder);

# Insert one row per gene, so we can look up by gene
# $gene from omim text file may be "GJA9, CX59"
my @genes = split(/,\s*/, $gene);
foreach my $g (@genes) {
# Should possibly insert lc($g) (lowercase)
&sql($dbh, $insertGene, $id, $mim, $g);
}
$id++;
}
close($IN);
}


----------------------------------------------
Here is the Perl file to parse the data, parseSnps ('parsnips').
Remove all lines that say '&track', '&trace', '&trackBegin', or '&trackEnd'.
It might not run right away; I am testing a slightly different version
that uses my libraries.
Usage: perl parseSnps -dbdir <directory for SQLite dbs> [-o outfile]
-snpdir <directory with genome file>
[-syn|-nosyn] [-omim|-noomim] [-ud|-noud] [-us|-nous]
where
-syn: Print info on synonymous substitutions
-omim: Skip SNPs not matched to genes in OMIM
-ud: Print info on heterozygous SNPs in genes of unknown dominance
-us: Print info on SNPs of fxn_class other than 41-44

Note that dominance information is not reliable; it is taken on a
per-gene basis,
when it can be different for different SNPs. Possibly dominance
should not be checked for that reason.


#!/usr/bin/perl -w

# Read a 23andme SNP file.
# Use the sqlite DB made from NCBI SNPContigLocusId
# See http://www.ncbi.nlm.nih.gov/projects/SNP/snp_db_table_description.cgi?t=SNPContigLocusId
# Construct:
# a new version of the reference genome with those SNPs substituted into it
# a new file that is the original SNP file, plus gene ID, gene name,
# a field indicating whether the SNP is synonymous or not,
# and a field indicating any entry in OMIM for that SNP
# Copyright 2011 by Phil Goetz

use strict;
use Carp qw(cluck);
# This awesome line makes errors give a stack trace:
local $SIG{__WARN__} = \&Carp::cluck;
use DBI;
use English;
use File::Basename;
my $Bin;
BEGIN {
$Bin = &dirname($0);
}
use lib "$Bin/lib";
use Getopt::Long qw(:config no_ignore_case no_auto_abbrev);
use dbiSqlite;

select STDOUT; $OUTPUT_AUTOFLUSH = 1;

## Declare and define all constants

my $Db; # Target database
my $Dbdir = 'data';
my $Omim = 1; # 1 => output only genes with disease from OMIM
my $OutFile = 'genome.txt';
my $SnpFile = 'genome_Your_Name_Full_Date.txt';
my $SnpDir ='directory/containing/SnpFile';
my $Syn = 0; # 1 => output synonymous substitutions
my $Test = 0;
my $UnknownDom = 1; # 1 => output SNPs with unknown dominance
my $UnknownSnp = 0; # 0 => output only SNPs with definitely-bad fxn_class

## Read command-line arguments
GetOptions(
'dbdir:s' => \$Dbdir,
'debug|d!' => \$DEBUG,
'omim!' => \$Omim,
'out|o:s' => \$OutFile,
'snpdir:s' => \$SnpDir,
'snps:s' => \$SnpFile,
'syn!' => \$Syn,
'test!' => \$Test,
'ud!' => \$UnknownDom,
'us!' => \$UnknownSnp,
);

my %Diseases; # hash of all implicated diseases

# Open connection to new db
my $snpH = &connectSQLite($Dbdir, 'snp') or die "Could not open DB snp";
my $omimH = &connectSQLite($Dbdir, 'omim') or die "Could not open DB omim";

# NOTE: 'locus_symbol' = gene symbol for genes
my $qSnpId = "SELECT contig_acc, asn_from, locus_id, locus_symbol,
protein_acc, allele, fxn_class FROM locusid WHERE snp_id=?";
my $qOmim = "SELECT gm.disorder, gm.dominance FROM gene g, genemap gm
WHERE g.gene=? AND g.id=gm.id";

my $snpFile = "$SnpDir/$SnpFile";
open(my $SNPS, "<$snpFile") or die "Could not open $snpFile";
open(my $OUT, ">$OutFile") or die "Could not write to $OutFile";
print $OUT "# Output from parseSnp, omim=$Omim syn=$Syn ud=$UnknownDom
us=$UnknownSnp\n";
my ($count, $found, $unfound, $ocount, $dcount) = (0, 0, 0, 0, 0);
while ((my $line = <$SNPS>) && (!$Test || $count++ < 1000)) {
chop($line) while ($line =~ m/[\n\r]$/);
my ($allele, $asn_from, $contig_acc, $fxn_class, $locus_id, $gene,
$protein_acc, $disease);
# = ('', '', '', '', '', '');
if ($line =~ m/^[rsi]+(\d+)\t/) {
my ($snpid, $chromo, $posnt, $genotype) = split(/\t+/, $line);
# Some snpid are eg i21234, which is a 23andme-unique identifier
# Someday this program should look up genes by chromosome and position
my $rsnum;
my $expressed = 0; # -1 => no, 0 => unknown, 1 => yes
if ($snpid =~ m/^rs(\d+)$/) {
$rsnum = $1;
# Look up rs# in dbSNP
($contig_acc, $asn_from, $locus_id, $gene, $protein_acc, $allele,
$fxn_class) =
&colSql($snpH, $qSnpId, $rsnum);
if ($gene) {
$found++;
$protein_acc = '' if $protein_acc eq '.'; # fix bug in DB
if ($genotype =~ m/$allele/) {
# Genotype has at least 1 copy of SNP allele
# Look up diseases and dominance in OMIM
my ($disease, $dom) = &colSql($omimH, $qOmim, $gene);
if ($disease) {
$ocount++; # count of SNPs matched to a disease in Omim
$disease =~ s/\t//; # fix bug in DB
#&track("disease for $gene = $disease");
if (length($allele) == 1) {
# Guess whether the disease phenotype is expressed
if ($genotype eq "$allele$allele") {
$expressed = 1; # homozygous
}
# Most entries in genemap lack an entry in omim.txt,
# therefore lack dominance info
elsif ($dom && $dom eq 'r') {
$expressed = -1; # recessive heterozygous
}
# Expressed if dominant (just guessing about x)
$expressed = 1 if $dom && ($dom eq 'd' or $dom eq 'x');
$dcount++ if ($expressed != 0); # dominance count
}
else {
&trace("NOTE: rs$snpid in $gene: Cannot determine expression of
allele $allele");
}
}
}
}
else {
#print "NOTE: Did not find $snpid\n";
$unfound++;
}
}
# Print only if SNP is non-synonymous, or if !$Syn
# fxn_class: see database/shared_data/SnpFunctionCode.bcp for complete list
# 3 => synonymous SNP
# 6 => intron
# 8 => cds-reference: ?
# 9 => unknown
# 41 => STOP-GAIN
# 42 => missense
# 43 => STOP-LOSS
# 44 => frameshift
if ($Syn || !defined($fxn_class) || $fxn_class != 3) {
# Print only if linked to a disease in OMIM, or if !$Omim
if (!$Omim || $disease) {
&track("$snpid\t$gene\t$genotype\t$allele\t$expressed\t$fxn_class")
if $allele;
# If $Omim and !$UnknownSnp, print only if fxn_class is serious
if ($UnknownSnp ||
# If $UnknownSnp, skip dominance test
($fxn_class > 40 && $fxn_class < 45 &&
($expressed > 0 || ($UnknownDom && $expressed > -1)))) {
# Indicate this person may have this disease
my @diseases = split(/;\s+/, $disease);
foreach my $dis (@diseases) {
$dis =~ s/[{}]//g;
$dis =~ s/^\s+//g;
$dis =~ s/\s+$//g;
$Diseases{$dis}++;
}
no warnings 'uninitialized';
print $OUT
"$snpid\t$chromo\t$posnt\t$genotype\t$allele\t$locus_id\t$gene\t$protein_acc\t$fxn_class\t$disease\n";
use warnings 'uninitialized';
}
}
}
}
else {
print $OUT "$line\n";
}
}
print $OUT "Found $found SNPs in dbSNP; $ocount matched genes in OMIM;
$dcount had dominance info.\n";
print $OUT "Did not find $unfound SNPs.\n";

# Print implicated diseases
while (my ($disease, $count) = each (%Diseases)) {
print $OUT "disease\t$disease\t$count\n";
}

close($SNPS);
close($OUT);

$snpH->disconnect;
$omimH->disconnect;

exit(0);


----------------------------------------------
Here is the library module they both use.
#!/usr/bin/perl -w

# Methods for using SQLite via DBI
# Copyright 2011 by Phil Goetz

use strict;
use DBI;
use DBI::Const::GetInfoType; # eg $dbh->get_info($GetInfoType{SQL_DBMS_NAME})
use Exporter 'import';


our @EXPORT = qw( &connectSqlite &makeTable &makeIndex &sql &colSql );


# NOTE: SQLite will create the DB if it doesn't exist
sub connectSqlite {
my($dbdirFull, $db, $cache) = @_;
my $dbh;

my $dbargs = {AutoCommit => 0, PrintError => 1};
$dbh = DBI->connect("DBI:SQLite:dbname=$dbdirFull/$db", "", "", $dbargs);
if ($cache) {
# Convert bytes to pages. Each page is about 1500 bytes.
$cache = $cache / 1500;
}
else {
$cache = 500000; # 750M for SQLite
}
my $cmd = "PRAGMA cache_size = $cache";
$dbh->do($cmd) or die "FATAL ERR: Can't do $cmd: $DBI::errstr";

# Tell SQL to hold temporary tables in memory
$cmd = 'PRAGMA temp_store = MEMORY';
$dbh->do($cmd) or die "FATAL ERR: Can't do $cmd: $DBI::errstr";

# Set a non-zero busy timeout, so we can catch 'db locked' with timeout
# NOTE: SQLite is notorious for locking the DB for many seconds to
write a single row
$dbh->func(10000, 'busy_timeout'); # 10 seconds

die "FATAL ERR: Did not connect to DB=$db: $DBI::errstr" if !$dbh;

return $dbh;
}

sub makeTable($$) {
my ($dbh, $cmd) = @_;
print "$cmd\n";
$dbh->do($cmd) or die "Could not do($cmd)";
$dbh->commit() or die "Could not commit $cmd into DB";
}

sub makeIndex($$$) {
my ($dbh, $table, $field) = @_;
my $cmd = "CREATE INDEX ${table}_${field}_index ON $table($field)";
if ( !$dbh->do($cmd) ) {
if ($DBI::errstr !~ m/index ${table}_${field}_index already exists/) {
die "Could not $cmd: $DBI::errstr";
}
else { print "NOTE: $DBI::errstr\n"; }
}
$dbh->commit() or die "Could not commit $cmd into DB";
}


# Wrap prepare, execute, and fetchrow_array
sub sql {
my( $dbh, $query, @args ) = @_;
my @results = ();

my $statementHandle = $dbh->prepare($query);
@results = &_executeSql($statementHandle, $query, @args);

return @results;
}

# Do a query for which we want only a single column in the results
# Return a list of the values from that column
# If given a multi-column query, will append all values of all columns
into @result
sub colSql {
my( $dbh, $query, @args ) = @_;
my @rows = &sql($dbh, $query, @args);
my @result;
foreach my $row (@rows) {
push(@result, @$row);
}
return @result;
}

# Returns a list of list references
sub _executeSql {
my( $statementHandle, $query, @args ) = @_;

$statementHandle->execute(@args)
or die "Failed query=($query) args=(@args)\n" . $statementHandle->errstr;

my @results = ();
my @row;
if ($query =~ m/^SELECT/i) {
while (@row = $statementHandle->fetchrow_array) {
my @local = @row;
push(@results, \@local);
}
}
#release the statement handle resources
$statementHandle->finish;

return(@results); # Returns 0 if $query wasn't a SELECT query
}

Gavin Scott

unread,
May 25, 2011, 8:55:51 PM5/25/11
to diy...@googlegroups.com, biocu...@googlegroups.com
On Wednesday, May 25, 2011 5:25:22 PM UTC-7, Phil wrote:

423586 of the 956734 SNPs in the report are listed in dbSNP.
I don't know why over half of the given rs numbers are not listed in
dbSNP; that shouldn't happen.

Is this a database/chip version issue? That looks similar to the number of SNPs measured by the previous generation of chips used by 23andme.

G.

Phil

unread,
May 25, 2011, 8:56:05 PM5/25/11
to DIYbio
On May 25, 8:25 pm, Philip Goetz <philgo...@gmail.com> wrote:

> So we expect to find about (1-.9915) x 3370 = 28.6 false positives
> (false genotypes reported at SNP locations).
> Most alleles are homozygous;
> so about 3/4 of these false positives will mismatch the disease allele.
> Some that match, will have a heterozygous match to a recessive allele.
> 2603/3370 have dominance info; 2/3 of these are dominant.
> If 1/4 of genes are heterozygous, screening out recessives will remove about
> 2603/3370 x 1/3 x 1/4 x 28.6 = 1.84 of the false positives.
> So we expect 28.6 x 1/4 - 1.84 = 5.3 false positive disease indications.
> (Note dominance should be applied per mutation rather than per gene,
> but I don't have that info.)

I forgot: Only 414,237/21,902,765 SNPs have fxn_class in the range
(41..44).
Se we expect 5.3 x 414,237/21,902,765 = .1
false positive disease indications, if you use -nous (the default),
and if this ratio holds for the SNPs in genes related to diseases
(which it probably doesn't; they probably have an
unusually high number of known detrimental mutations;
that's why we know about them).
21686073 SNPs have fxn_class > 3, so if you use -us,
you still expect 5.3 x 21,686,073/21,902,765 ~ 5.2 false positives.

The program to build the databases ran out of RAM when trying to
commit changes when I ran it under Windows XP, which can use only 2G
of RAM (3G for some programs if given the \3gb parm in C:\boot.ini,
but probably not for Perl).
It would probably work if you put the $dbh->commit() statement inside
the outer loop with a counter and a test,
so that it did a commit once every 100,000 entries.
It ran fine under Linux with 6G RAM.
On my system, the final snp SQLite DB is 2932052992 bytes (2.8G).

The Google groups page hides the final closing parenthesis behind a
"Show quoted text".

- Phil

fxn_class codes:

$ zcat SnpFunctionCode.bcp.gz
3 cds-synon synonymous change. ex. rs248, GAG->GAA, both
produce amino acid: Glu 2003-02-03
16:01:00.0 cSNP 1 1
6 intron intron. ex. rs249. 2003-02-03 16:01:00.0
other 0
8 cds-reference contig reference 2003-02-03
16:01:00.0 other 1
9 synonymy unknown coding: synonymy unknown
2003-02-03 16:01:00.0 other 1
13 nearGene-3 within 3' 0.5kb to a gene. ex. rs3916027 is
at NT_030737.9 pos7669796, within 50
0 bp of UTR starts 7669698 for NM_000237.2. 2006-11-21
00:00:00.0 other 0
15 nearGene-5 within 5' 2kb to a gene. ex. rs7641128 is at
NT_030737.9 pos7641128, with 2K bp of
UTR starts 7641510 for NM_000237.2. 2006-11-21 00:00:00.0
other 0
41 STOP-GAIN changes to STOP codon. ex. rs328, TCA->TGA,
Ser to terminator. 2005-08-01 00:00:0
0.0 cSNP 1 1
42 missense alters codon to make an altered amino acid in
protein product. ex. rs300, ACT->GC
T, Thr->Ala. 2005-08-01 00:00:00.0 cSNP 1 1
43 STOP-LOSS changes STOP codon to other non stop
codon 2010-09-13 16:00:00.0 cSNP 1
44 frameshift indel snp causing frameshift. 2005-08-01
00:00:00.0 cSNP 1 1
45 cds-indel indel snp with length of multiple of 3bp, not
causing frameshift. 2010-01-07
10:12:00.0 cSNP 1
53 UTR-3 3 prime untranslated region. ex. rs3289.
2005-08-01 00:00:00.0 other 0 1
55 UTR-5 5 prime untranslated region. ex. rs1800590.
2005-08-01 00:00:00.0 other 0 1
73 splice-3 3 prime acceptor dinucleotide. The last two
bases in the 3 prime end of an intron.
Most intron ends with AG.ex.rs193227 is in acceptor site.
2005-08-01 00:00:00.0 other 0
75 splice-5 5 prime donor dinucleotide. 1st two bases in
the 5 prime end of the intron. Most i
ntron starts is GU. ex.rs8424 is in donor site. 2005-08-01
00:00:00.0 other 0

Cory Tobin

unread,
May 25, 2011, 9:01:23 PM5/25/11
to diy...@googlegroups.com
> (23andme doesn't even tell you exactly which Illumina chip they use.)

As of late 2010 they are using the Illumina HumanOmniExpress Plus
which is basically just a custom version of the HumanOmniExpress.
http://www.illumina.com/products/human_omni_express.ilmn
23andme added a bunch of extra SNPs to it - 20,000 or so iirc.


-cory

Phil

unread,
May 25, 2011, 9:17:21 PM5/25/11
to DIYbio
I see that the other numbers are listed in dbSNP; but only 423586
of them are listed in the dbSNP file
organisms/human_9606/database/organism_data/
b132_SNPContigLocusId_37_1.bcp
that I used.

Tom Randall

unread,
May 26, 2011, 11:51:36 AM5/26/11
to DIYbio
Another very useful collection of annotations to attach to your raw
SNP data can be found here:

http://www.genome.gov/gwastudies/

The GWAS catalog, download the tab delimited file. This is updated
daily and contains all the SNPs that have been identified as being
associated with some phenotype in a published study, with p values,
references, type of chip the study was performed on, etc. Much more
information that 23andme gives you via their website. I dont know the
overlap between these data and prometheus however, but todays catalog
has over 5000 results.

Phil

unread,
May 26, 2011, 12:00:40 PM5/26/11
to DIYbio
On May 25, 9:01 pm, Cory Tobin <cory.to...@gmail.com> wrote:
> > (23andme doesn't even tell you exactly which Illumina chip they use.)
>
> As of late 2010 they are using the Illumina HumanOmniExpress Plus
> which is basically just a custom version of the HumanOmniExpress.http://www.illumina.com/products/human_omni_express.ilmn
> 23andme added a bunch of extra SNPs to it - 20,000 or so iirc.
>
> -cory

The JHU core SNP center offers genotyping with the OmniExpress.
Just $330/sample - if you have at least 1800 samples.
http://snpcenter.grcf.jhmi.edu/pricing.html
You have to pay $384/sample if you only have 46 samples.

How on earth did 23andme get the price down to around $100?
They must be losing money.

Phil

unread,
May 26, 2011, 12:57:07 PM5/26/11
to DIYbio
On May 26, 11:51 am, Tom Randall <tarand...@gmail.com> wrote:
> Another very useful collection of annotations to attach to your raw
> SNP data can be found here:
>
> http://www.genome.gov/gwastudies/

Thanks!

I tried Promethease last night.
I couldn't get the $2 deluxe version - it has too much functionality
built into the application,
including the payment. The program tries to open a website that does
not exist.
The Promethease web page gives you no way to contact the authors,
so I can't inform them of this.

The free version's results were awesome - sNPedia contains much more
usable information that my program extracted. However, the abundance
of results made many things uninterpretable. For instance, I saw
about 7
different SNPs influencing premature baldness, some increasing
likelihood,
some decreasing it; each with different certainty. Same story for
Alzheimer's
disease.

At a minimum, the program should collect all the SNPs
that influence a disease, assume independence, and combine all
their certainties and likelihood odds to construct an outcome
probability distribution. As it is, it's uninterpretable for
conditions
for which multiple conflicting SNPs turn up.

I'm corresponding with NCBI about the missing rs#'s.
They don't know why they aren't in that file.

Gavin Scott

unread,
May 26, 2011, 1:03:45 PM5/26/11
to diy...@googlegroups.com
On Thursday, May 26, 2011 9:00:40 AM UTC-7, Phil wrote:
How on earth did 23andme get the price down to around $100?
They must be losing money.

Well, the $100 is only with a perpetual $10/month recurring charge for access to the data and their ongoing updates. It's $300 if you don't want to pay the monthly fee. Also it's my impression that they get a price-break on processing because they get handled as a "lowest priority" by their sequencing service vendor who only runs their samples when they have nothing better to do, hence the sometimes interminable waits before your 23andme data is available.

So I suspect they pay around, or even a bit less than, the $300 that they charge. To the degree that people are willing to sign up for the $10/month version, they may make a killing off the service as over, say, five years that amounts to $700 per customer.

G.

Phil

unread,
May 26, 2011, 6:22:13 PM5/26/11
to DIYbio
On May 25, 9:17 pm, Phil <philgo...@gmail.com> wrote:
> On May 25, 8:55 pm, Gavin Scott <ga...@allegro.com> wrote:
>
> > On Wednesday, May 25, 2011 5:25:22 PM UTC-7, Phil wrote:
>
> > > 423586 of the 956734 SNPs in the report are listed in dbSNP.
> > > I don't know why over half of the given rs numbers are not listed in
> > > dbSNP; that shouldn't happen.
>
>
> I see that the other numbers are listed in dbSNP; but only 423586
> of them are listed in the dbSNP file
> organisms/human_9606/database/organism_data/
> b132_SNPContigLocusId_37_1.bcp
> that I used.

Lon Phan of NCBI wrote: "b132_SNPContigLocusId_37_1.bcp only list the
rs that are mapped to a gene region."
The "missing" SNPs are not in genes.

Patrik

unread,
May 27, 2011, 2:55:13 AM5/27/11
to DIYbio
I've been looking at some of the "no call" entries in my 23andMe data.
Supposedly, the quality thresholds for calls are chosen such that you
should get around 1% no-calls on the autosomal chromosomes, which
works out about perfectly for me. However, I did notice that there are
significantly more runs of 2 or more adjacent no-calls than would be
expected if they were randomly distributed.

Assuming the locations of the SNPs on the array are randomized (which
they should be), one hypothesis for these blocks of no-calls could be
deletions. If so, we could assume that the phenotype for a deletion is
likely to be at least as deleterious as the worst phenotype of the
SNPs it covers.

Things to check: Do these stretches of no-calls occur in other
publicly available 23andMe data sets, and if so, are they correlated
with ancestry? Where do these putative deletions map with respect to
gene boundaries, and is there any bias towards intergenic regions
(which should make the phenotypes less severe, and the deletion
hypothesis seem more likely)?

Phil

unread,
May 29, 2011, 1:57:20 PM5/29/11
to DIYbio
That is a very interesting observation.
I'm not aware of any public-available 23andme datasets, however.

Half the SNPs come from intergenic regions, so there could be
lots of deletions/insertions in these areas.
My first question would be whether these no-call SNPs
are listed in the dbSNP file
organisms/human_9606/database/organism_data/
b132_SNPContigLocusId_37_1.bcp .
If not, they are intergenic.

Daniel MacArthur

unread,
Jun 3, 2011, 9:39:27 AM6/3/11
to DIYbio
You can find a list of publicly available 23andMe (and other DTC
genomic) data-sets here:

https://spreadsheets.google.com/ccc?key=0AlJjaPERgZNadDh1TXowUnR4TFN3MmhEWl9EbXZQd0E

For instance, my own data and 11 of my colleagues at the Genomes
Unzipped project can be downloaded (or accessed by API or browser)
here:

http://www.genomesunzipped.org/data

On the question of why 23andMe sells their kits so cheap - bear in
mind that their business model is not selling kits. Rather, the
company is building a large database of individuals who contribute
their genotype data as well as trait information (through surveys) for
research, which it can then sell (in a responsibly anonymised fashion)
to third parties, e.g. pharmaceutical companies. But this type of
research requires big numbers, which means selling kits cheap - even
at a substantial loss.

Tom Randall

unread,
Jun 4, 2011, 12:57:05 PM6/4/11
to DIYbio
More genomes/genotypes, some 23andme, some not, are here:

http://www.snpedia.com/index.php/Genomes


Daniel MacArthur

unread,
Jun 5, 2011, 7:50:12 PM6/5/11
to DIYbio
The spreadsheet link was bad, I think. Here's another try:

http://bit.ly/publicgenetic

On Jun 3, 2:39 pm, Daniel MacArthur <dgmacart...@gmail.com> wrote:
> You can find a list of publicly available 23andMe (and other DTC
> genomic) data-sets here:
>
> https://spreadsheets.google.com/ccc?key=0AlJjaPERgZNadDh1TXowUnR4TFN3...
>
> For instance, my own data and 11 of my colleagues at the Genomes
> Unzipped project can be downloaded (or accessed by API or browser)
> here:
>
> http://www.genomesunzipped.org/data
>
> On the question of why 23andMe sells their kits so cheap - bear in
> mind that their business model is not selling kits. Rather, the
> company is building a large database of individuals who contribute
> their genotype data as well as trait information (through surveys) for
> research, which it can then sell (in a responsibly anonymised fashion)
> to third parties, e.g. pharmaceutical companies. But this type of
> research requires big numbers, which means selling kits cheap - even
> at a substantial loss.
>
> On May 29, 6:57 pm, Phil <philgo...@gmail.com> wrote:
>
>
>
> > That is a very interesting observation.
> > I'm not aware of any public-available 23andme datasets, however.
>
> > Half the SNPs come from intergenic regions, so there could be
> > lots of deletions/insertions in these areas.
> > My first question would be whether these no-call SNPs
> > are listed in the dbSNP file
> > organisms/human_9606/database/organism_data/
> > b132_SNPContigLocusId_37_1.bcp .
> > If not, they are intergenic.- Hide quoted text -
>
> - Show quoted text -

Jason Bobe

unread,
Jun 8, 2011, 2:46:49 AM6/8/11
to diy...@googlegroups.com


On Sunday, June 5, 2011 4:50:12 PM UTC-7, Daniel MacArthur wrote:
The spreadsheet link was bad, I think. Here's another try:

http://bit.ly/publicgenetic

A few more places for accessing public human genomes include:

(1) Amazon Public Datasets: 

(BTW, anyone here have experience using amazon for performing computation on genomic datasets?)

(2) CGI has 69 human genomes here:

(3) More public genomes at PersonalGenomes.org (my employer):
 
Jason

Patrik

unread,
Jun 9, 2011, 2:01:03 AM6/9/11
to DIYbio
Hiya Jason - how common is it to see short deletions in whole genome
data? I haven't really dealt with the whole genomes yet, but that
seems like a plausible explanation for stretches of no-calls in SNP
data.

amisme

unread,
Aug 10, 2015, 9:31:26 AM8/10/15
to DIYbio, biocu...@googlegroups.com
I fear I am far from experienced enough to glean much from this fantastic post. I am trying to assess information from my family's genotype and I am drawing a blank when comparing MIM numbers to 23andme's data. I'm afraid building a SQLite DB is a bit beyond me. Is this the only way to correlate these? I'm am definitely missing a step between looking up something in the OMIM database (say colorectal cancer: http://omim.org/entry/114500) and then trying to determine from my 23andMe data whether that genotype is present. Is there a list that correlates rsids with gene/locus MIM numbers? or positions? I do not understand why I am not finding similar numbers/terminology in each list.
Reply all
Reply to author
Forward
0 new messages