Is there any description of the algorithm of JointSNVMix?

42 views
Skip to first unread message

Sangwoo Kim

unread,
Jan 14, 2012, 7:02:33 PM1/14/12
to JointSNVMix User Group, jhl...@ucsd.edu
Hi,

My name is Sangwoo Kim from UCSD, CSE department.
Thank you for developing such a useful tool for biological/medical
analysis.

I tried to find some methodological description of the JointSNVMix
algorithm.
I currently understand that it depends on the SNVMix (Goya et al)
Could you give any brief description of how they are related?
Preferably in mathematics?
We are planning to analyze some tumor/normal pairs but would like to
know if the tool is applicable to the condition of our samples.

Thank you.

Regards,

Sangwoo

aroth

unread,
Jan 15, 2012, 2:14:39 PM1/15/12
to JointSNVMix User Group
Hi Sangwoo,
JointSNVMix is conceptually very similar to SNVMix.

SNVMix can be though as mixture model where there three hidden states
{AA, AB, BB}, the three diploid genotypes. For SNVMix1 each component
in the mixture uses a Binomial density, while for SNVMix2 they use the
more complicated density which depends on base qualities.

JointSNVMix extends this framework to two samples. To do so we replace
the three hidden states of SNVMix, with nine hidden states {AA_AA,
AA_AB, ..., BB_BB}. Here AA_AB would indicate the normal sample has
the genotype AA and the tumour has the genotype AB. Thus JointSNVMix
is just a nine state mixture model. The component densities for
JointSNVMix1 are the products of two binomial densities, intuitively
we take the SNVMix1 densities for the normal and tumour sample and
multiply them together. The same idea is true for JointSNVMix2, where
we use the SNVMix2 densities.

We have submitted a paper which should be out soon. If that is delayed
I will try to make a file with a full explanation of the model
available on the website.

Cheers,
Andy

Henning Stehr

unread,
Jan 15, 2012, 3:06:56 PM1/15/12
to andrew...@gmail.com, JointSNVMix User Group
Hi Andy,

thanks for this nice explanation. I have an additional question: Why
is it that SNVMix requires genotype information as training data while
JointSNVMix can be trained on the BAM files directly. Or did I miss
anything there?

Thanks,
Henning

aroth

unread,
Jan 15, 2012, 4:01:51 PM1/15/12
to JointSNVMix User Group
Hi Henning,
To be honest I am not 100% sure how the original SNVMix software was
implemented. The fact that know genotypes are passed in would imply
that it uses supervised training, however the paper states that EM is
used which would be unnecessary if that were true. I suggest e-mailing
Dr. Shah about the details of that paper.

JointSNVMix does not need the genotypes, since it uses the classic EM
training strategy for fitting mixture models. This means that it is
essentially an unsupervised clustering algorithm. In some sense
JointSNVMix acts like K-means or a Gaussian mixture model; except we
don't expect spherical clusters on the data points, we know how many
clusters there are and each cluster has a biological interpretation.

I hope that helps.

Cheers,
Andy

Sangwoo Kim

unread,
Jan 16, 2012, 12:28:41 PM1/16/12
to JointSNVMix User Group
Hi Andy,
Thank you for your kind explanation.

Can I ask a few more questions?

1. You explained that the component densities for JointSNVMix are the
products of two binomial densities. If we run SNVMix for a normal and
a tumor sample separately and simply concatenate each genotype to call
{genotype_normal + genotype_tumor} (e.g. AA_AA, AA_AB, or BB_BB},
does it make a big difference from running JointSNVMix together with
mixed samples?

2. Can any of normal/tumor samples be heterogeneous? I mean, for
example
if the tumor sample is highly heterogeneous or contaminated, the tumor
genotype
can be a mixture of two different populations (e.g. 50% AA and 50%
AB),
generating a bit more complicated distribution. Is it in the
JointSNVMix's range?

Thanks

Sangwoo

aroth

unread,
Jan 16, 2012, 1:10:14 PM1/16/12
to JointSNVMix User Group
Hi Sangwoo,

1. Short answer is that JointSNVMix is much more specific and a little
less sensitive to the approach you describe.

Longer answer. The main difference between JointSNVMix and running
SNVMix as described, is that the latter ignores correlation between
samples. The result will be that using two independent SNVMix runs
combined post-hoc tends to predict a large number of false positives,
though it also predicts more true positives. In practice the large
number of false positives from running SNVMix was what motivated the
development of JointSNVMix.

Running SNVMix and combining the results post-hoc is included in the
JointSNVMix software. You can use the command snv_mix_one/snv_mix_two
in place of joint_snv_mix_one/joint_snv_mix_two when training and
classifying. Be aware these commands use different priors and
parameters files, but these are included in the program config/
folder. They are called indep_params.cfg and indep_priors.cfg.

2. Short answer the software does not explicitly model heterogeneity.
However, training the model should let the parameters adapt to this
situation. Every sample we have used the software on has some level of
normal contamination, so I would say the software can handle this
situation provided training is run.

Long answer. In principle what you are describing would lead to a
landscape of allelic frequencies (#ref_bases/dept) in the tumour with
more than three modes. To understand this remember that if it were
pure tumour we would see three peaks for (AA, AB, BB), ignoring copy
number. If we are now in the situation you describe then we would have
peaks for (AA_AA, AA_AB, AA_BB, ..., BB_BB). This might not be a huge
issue however. Consider a position which is AA_AB. For the program to
work, all that is required is that it is significantly more probable
that it matches AB cluster in the tumour, than either the AA or BB
clusters.

I do consider this a limitation along with not explicitly handling
copy number, and would like to address it in the future.

One simple idea I have is to expand the number of parameters used in
the binomial densities. At present, there are only 6 parameters for
the binomial densities, 3 for the normal and 3 for the tumour. This is
accomplished by sharing say the tumour parameter for the genotype AB,
across the classes AA_AB, AB_AB, BB_BB. A simple way to handle
heterogeneity in the tumour might be to change this setup so that
there are 9 parameters for the tumour.

Hopefully that helps.

Cheers,
Andy

Sangwoo Kim

unread,
Jan 17, 2012, 5:23:53 PM1/17/12
to JointSNVMix User Group
Andy

I loved your answer. It helped a lot.

Thank you

Sangwoo

Andrew Heiberg

unread,
Feb 1, 2012, 7:01:30 PM2/1/12
to JointSNVMix User Group
Hi Andy,

My name is Andrew, also from UCSD and also working with Sangwoo on
this project. As he mentioned, our tumor samples are contaminated; we
are guessing anywhere from 10-50%. However, we want to know for
sure. My understanding is the training step computes the
distributions that are the most likely to have generated the data.
Would it be possible, using these distributions, to estimate our
contamination level?

Also, at the end of your last post, you mentioned there are 3
parameters for each binomial density. Perhaps you were simplifying
things for our sake, but why aren't they the traditional 'n' and 'p'?

Thanks very much,

Andrew

aroth

unread,
Feb 1, 2012, 11:06:11 PM2/1/12
to JointSNVMix User Group
Hi Andrew,
JointSNVMix is probably not the right tool for trying to estimate out
the normal contamination. The estimated parameters do not explicitly
account for cellularity and will be confounded with other factors such
as platform/alignment bias and aneuploidy. If you are working with
WGSS data my lab has recently developed a tool APOLLOH <http://
compbio.bccrc.ca/software/apolloh/> which can estimate normal
contamination in addition to predicting regions of LOH. The software
should also work with exon capture data, though my impression from the
software's author is it will be less accurate.

As to the second question. What I meant to say was there are three
separate binomial densities per genotype each with their own 'p' which
I call 'mu' in the parameters file. The 'n' is set based on the depth
of reads at that position. For clarification the original SNVMix paper
by Goya et al. should help. In addition JointSNVMix was just published
in Bioinformatics, the early access can be found here <http://
bioinformatics.oxfordjournals.org/content/early/2012/01/27/
bioinformatics.bts053.abstract>. If you are particularly interested in
the details, the supplemental material contains a full description of
the model and derivation of all the EM updates.

I hope that helps.

Andy

Andrew Heiberg

unread,
Feb 2, 2012, 3:35:35 PM2/2/12
to JointSNVMix User Group
Thanks Andy, this was a big help
Reply all
Reply to author
Forward
0 new messages