SNP Data Combination Project Proposal

4 views

Skip to first unread message

Jeff A.

unread,

Apr 11, 2008, 10:11:12 AM4/11/08

to OpenBiomind

Hello. It was recommended to me that I seek suggestions pertaining to
my project in this group. This seems like a very helpful resource, so
I thought I would do just that. Attached is my project proposal. Any
ideas or criticism is welcome!

Thanks,

Jeff

================================================================
ABSTRACT:
=========
Single-nucleotide polymorphisms have been shown to have a strong
relationship to diseases of various types. A polymorphism can produce
mis-structured proteins, have an effect on gene regulation and
control, or have various other effects. Thus, SNP microarrays (which
elucidate the presence of SNPs in various loci on the DNA) has proven
useful in determining genetic associations with certain diseases. This
is now being applied on an individual level as organizations such as
23andMe begin to offer personalized analyses of the likelihood of an
individual having various diseases and conditions.
The fundamental foundation on which these systems are built is the
culmination of data into a usable system. There are a handful of
companies in the market for SNP chip production, each with a variety
of chips they offer. Thus, in order to provide a reasonable predictive
engine for diseases, one must find a way to combine all of this data.
I propose to develop a novel system in which SNP data can be combined
with statistical and empiric evidence. After studying the specific
SNPs within an individual, we can compare the SNPs to empirical
research which will provide an association of each SNP to a set of
possible diseases and conditions. On a higher level, we will consider
the haplotype of each SNP array as well. This will provide the
"networks" of SNPs which are commonly associated (Linkage
Disequilibrium). In doing this, we will be able to bridge the gaps
between chips and manufacturers as well as provide a more authentic
validation at an individual level.

===============
PROPOSAL:
===============
To the mentors of the OpenCog/OpenBiomind,

In the interest of combining SNP data of various forms, multiple
problems are encountered. First, not all chip manufacturers poll the
same loci for SNPs. The more different these chips are, the more they
begin to create problem when attempting to synthesize data sets. Among
the same manufacturer, there may be quite a bit of variation
(Affymetrix 10K vs. Affymetrix 500K). These problems only get worse
when considering combining chips from other manufacturers.
The naïve approach would be to only keep those loci common across all
the chips in consideration. This would throw away a large portion of
useful information. The system would also continue to lose loci as
more data gets inserted into the system; clearly this is not a viable
approach.
Thus, statistical analysis must be combined with current databases
which have done much of the work for us. The solution is not a trivial
one. Individual SNPs should be studied and considered in order to find
their pertinence to a given condition. However, diseases tend to
necessitate/involve SNPs in various loci (creating a sort of pathway
of SNPs called a Linkage Disequilibrium). By considering the LD of SNP
data, we can not only improve the accuracy of the disease
classification, but can also bridge gaps across various chips and chip
manufacturers. Thus, the solution comes through a combination of the
concept of haplotypes (and the associated HapMap project) as well as
through referencing collections of data pertaining to the study of
individual SNPs (NCBI's dbSNP project).
Rather than normalizing the data or parsing it all universally, we
should preserve the unique traits of each set in their combination.
The first step, then, is to consider each individual SNP by polling
the dbSNP database (openly available for download). The dbSNP database
will provide information on individual SNPs in the genome. The
database itself is very extensive and contains plenty of information
about most known SNP associations.
The other "class" of data will be found through the concept of
Linkage Disequilibrium by tracing two things: first, the associations
between SNPs of an individual (if possible, given the SNP data). For
instance, if a patient has 47 of 50 SNPs associated with a haplotype,
it can be deduced that this patient has a high probability of having
that haplotype (set of statistically associated SNPs).
The second use of LD data is in bridging the gaps between
manufacturers; this, then, is a primary emphasis of the project. By
observing haplotypes among LDs (using the HapMap project), we will be
able to infer relationships between individuals. Thus, even if a
certain SNP data set leaves us with only a small percentage of
pertinent SNPs (Affymetrix 10K, for instance), this data set can still
be statistically significant by observing the "trends" (SNPs within
certain haplotypes) across the data.

I consider two output formats for the synthesis of the data:
1.Haplotype-driven form. This is a higher form which will provide
insight into the trends among the SNPs. This way, we can quickly find
the patterns present in the data without considering the individual
SNPs. I envision that this will have a much more universal application
in the prediction of diseases.
2.Individual SNP-level form. This is a much more specific form of
output which would concern itself with the statistical likelihood of a
SNP being present in an individual. The three pieces of interest in
this form would be the SNP identifier/location (either in a dbSNP
format, or an absolute reference, etc.), SNP probability (the
probability of the SNP being present in the individual) and empirical
verification (Present, Absent, NoCall, N/A).
The final aspect of the project is in the combination of this data.
This is, arguably, the most difficult part. Luckily, there are
references in the databases which will help us. The dbSNP database,
for instance, produces the clinical association (if any) for each SNP
and the validation for that relationship. Thus, from each individual
SNP, we can generate an overall "map" of a patient's likelihood of
disease based on individual SNPs.
The other source of clinical associations is in the HapMap project.
Ideally, the more recent chips will include so called "tag SNPs" (a
SNP representative of an entire haplotype). HapMap currently estimates
that there are between 300,000 and 600,000 tag SNPs - a much more
reasonable number of loci to query on a SNP chip. Even if the chip
doesn't contain many of these tags, the SNPs found may be associated
with a certain haplotype. The HapMap project handles disease
associations as well.
The combination of these two sets of "votes" (one from HapMap, one
from dbSNP) is another pertinent issue which will have to be resolved.
I feel it would be naïve to establish concretely what method will be
used. I think it will take much testing and statistical analysis to
find the best method of associating the disease. However, I have a few
options in mind:
1.Weighted Majority Vote - determine an accuracy (weight) for each
method (and any alternative method added in the future) and weigh the
vote of each method accordingly. This could eventually produce a "top
n" table of disease potentials, even with a percentage likelihood
(with error).
2. Mean and Median rank - I could average the ranks within both
methods and then combine them accordingly.
3.Shrinkage parameter - much more complex, but involves the
calculation of 'g,' the shrinkage parameter from each statistic.

Various others will be considered. I would like to spend a fair amount
of time on this aspect of the project. I feel that my strong base in
Biostatistics as well as my resources within the field will be very
helpful in developing a sophisticated, statistically viable system in
which these ranking methods can be combined. Again, it will largely be
a matter of testing to see which works best as a combinational method.

One further consideration would be the use of the Clark Algorithm to
produce my own statistically verified haplotypes from the data as
opposed to referencing them empirically. If time permits, I would also
enjoy adding the capability to consider Loss of Heterozygosity (LOH).
This would provide additional accuracy in the analysis of SNP data in
regards to cancers, primarily.

NOTE: I've also provided a timeline and personal background on my
website.

Reply all

Reply to author

Forward

0 new messages