SNAPP mutation rates

1,395 views
Skip to first unread message

Paolo Gratton

unread,
Feb 24, 2014, 2:46:02 PM2/24/14
to beast...@googlegroups.com
Hi all,

I probably have some more naif questions about SNAPP, but I could not figure out by reading the manual and Bryant et al.'s 2012 paper.

1) In Bryant et al.,'s paper I read that as a prior implemented in the SNAPP "the stationary allele proportions are fixed at the observed frequencies of red and green alleles in the data", and, in the manual ("A rough guide to SNAPP", october 11 2012), I read that, "given an estimate of pi_0 and pi_1", u and v can be obtained by u = 1/2pi_0 and v = 1/2p_1.

Does it mean that I can get a sensible prior by calculating the frequencies of 0 and 1 in the whole data matrix and using them as estimates of pi_0 and pi_1 to calculate u and v? In this case, mu = 2(u*v)/(u+v) = 1, right?

Moreover, am I wrong if I think that, for SNPs data, there is not much reason to think that u and v are not the same (while with AFLP data losing a site would be probably easier than acquiring it)? 

2) assuming that mu = 1, would it make sense to scale the species trees obtained by RAD-sequencing SNPs data by an average genomic point mutation rate?

I.e., if, for example, the root height is estimated at 0.02 and the mutation rate is 10^-8, the absolute time would be 0.02/10^-8 = 2,000,000.
Or would the actual rate be affected by the fact that I am only using variable positions? Or anything else I did not think about.

I hope this can be a useful topic.. and not too stupid.

Thank you a lot!!

Paolo

David Bryant

unread,
Feb 24, 2014, 3:44:49 PM2/24/14
to beast...@googlegroups.com
Hi Paolo,


On Tuesday, February 25, 2014 8:46:02 AM UTC+13, Paolo Gratton wrote:
Hi all,

I probably have some more naif questions about SNAPP, but I could not figure out by reading the manual and Bryant et al.'s 2012 paper.

1) In Bryant et al.,'s paper I read that as a prior implemented in the SNAPP "the stationary allele proportions are fixed at the observed frequencies of red and green alleles in the data", and, in the manual ("A rough guide to SNAPP", october 11 2012), I read that, "given an estimate of pi_0 and pi_1", u and v can be obtained by u = 1/2pi_0 and v = 1/2p_1.


Thats correct, u = 1/(2 pi_0)   v = 1/(2 pi_1)
 
Does it mean that I can get a sensible prior by calculating the frequencies of 0 and 1 in the whole data matrix and using them as estimates of pi_0 and pi_1 to calculate u and v? In this case, mu = 2(u*v)/(u+v) = 1, right?

Yes, although there is a bit of a statement of faith here:- in our experience the postieriors of pi_0 and pi_1 were strongly peaked around the observed values. This will not be the case in small data sets, and it could well be that in the future someone publishes work demonstrating that this assumption is unwarranted for this or that type of data. 

If you substitute the values for u and v into the formula you will, as you point out, get mu=1. Put in another way, we are measuring branch lengths in units of expected mutations per site. Often in population genetics branch lengths are measured in terms of coalescent units, but this makes no sense for SNAPP where different populations have different effective population sizes.


Moreover, am I wrong if I think that, for SNPs data, there is not much reason to think that u and v are not the same (while with AFLP data losing a site would be probably easier than acquiring it)? 


Yes, I think that this is a reasonable model assumption. I would also think that drift would be playing a bigger role in the observed allele frequencies than mutation (at least that might justify a fairly approximate model for mutation itself)


2) assuming that mu = 1, would it make sense to scale the species trees obtained by RAD-sequencing SNPs data by an average genomic point mutation rate?

I.e., if, for example, the root height is estimated at 0.02 and the mutation rate is 10^-8, the absolute time would be 0.02/10^-8 = 2,000,000.
Or would the actual rate be affected by the fact that I am only using variable positions? Or anything else I did not think about.


Tricky question. First make sure that you have unchecked the "non-polymorphic" checkbox in the mutation model, telling SNAPP that your data set will only
include the variable sites. Once that is done, the scale of the branch lengths will be in terms of expected mutation per site, meaning the rate averaged over
variable and constant sites. Hence your calculation would be correct.

If, instead, you are including all of the sites in the alignment (constant as well), you will need to check the "non-polymorphic" checkbox to get the scale right.

 
I hope this can be a useful topic.. and not too stupid.

Not at all. I have to stop and think every time I come to a rates question.... 

-David.



Remco Bouckaert

unread,
Feb 24, 2014, 6:29:43 PM2/24/14
to beast...@googlegroups.com
Hi Paolo,

On Tuesday, February 25, 2014 8:46:02 AM UTC+13, Paolo Gratton wrote:
1) In Bryant et al.,'s paper I read that as a prior implemented in the SNAPP "the stationary allele proportions are fixed at the observed frequencies of red and green alleles in the data", and, in the manual ("A rough guide to SNAPP", october 11 2012), I read that, "given an estimate of pi_0 and pi_1", u and v can be obtained by u = 1/2pi_0 and v = 1/2p_1.

Does it mean that I can get a sensible prior by calculating the frequencies of 0 and 1 in the whole data matrix and using them as estimates of pi_0 and pi_1 to calculate u and v? In this case, mu = 2(u*v)/(u+v) = 1, right?

That is correct.
 

Moreover, am I wrong if I think that, for SNPs data, there is not much reason to think that u and v are not the same (while with AFLP data losing a site would be probably easier than acquiring it)? 

You could easily replace 0s by 1s and vice versa for a site, and the SNAPP model still works but this will have an effect on u and v. So from that perspective, there is no reason to assume u and v must be the same.


 
2) assuming that mu = 1, would it make sense to scale the species trees obtained by RAD-sequencing SNPs data by an average genomic point mutation rate?

I.e., if, for example, the root height is estimated at 0.02 and the mutation rate is 10^-8, the absolute time would be 0.02/10^-8 = 2,000,000.
Or would the actual rate be affected by the fact that I am only using variable positions? Or anything else I did not think about.


In SNAPP, branch lengths and population sizes are highly correlated. If you have a strong prior on the population sizes, or your data is not very informative as far as population sizes is concerned, the prior may dominate population sizes and thus have an effect on time estimates. In principle the tree that SNAPP produces is in units of substitutions per site, so you could replace  mutation rate by an absolute rate to get time estimates. As far as I know, no study has been done to see how accurate such estimates are though.

Cheers,

Remco

 

paolo gratton

unread,
Feb 24, 2014, 8:11:38 PM2/24/14
to beast-users
Thank you again, Remco

you are incredibly helpful.

However, I am sorry, but I am afraid I am missing some points..

- BACK AND FORTH RATES

You say that, with SNPs, one could "easily replace 0s by 1s and vice versa for a site", and that would change the values of u and v. 

But I am confused by the idea of 'averaging' pi_0 and pi_1 (and thus u and v) across loci. For example, imagine I have a data matrix with just two loci. One is coded so that I have pi_0 = 0.75 and pi_1 = 0.25 (but that could be the opposite, since it is completely arbitrary). For this locus, the appropriate rates would be u = 1/(2*0.75) = 0.66 and v = 1/(2*0.25) = 2. For the other locus I could have pi_0 = 0.15 and pi_1 = 0.85, and the appropriate rates would be u = 3.33 and v = 0.59. I don't see why SNAPP should be told that the rates are the 'average' u = 1.11 and v = 0.99, which is not the 'right' value for either loci. What am I missing?

- ABSOLUTE RATES

In fact I was just trying to understand how my results can be so 'clean' with regard to thetas and times. As I see it, disentangling time and pop size is always puzzling.

Nonetheless, the trees I am getting (after ESS for all parameters are > 100) have relatively narrow CIs on divergence times.

However, if I try to apply a 'typical' genomic rate (ca. 10^-8 subst./(site*generation)), I get very large estimates for both Ne and times. Actually, I have done a *Beast run on the same samples, using sequences from 8 nuclear loci and a couple of mtDNA genes, and (applying 'standard' mtDNA rates) the estimated times should be at least one order of magnitude lower.

I have also noticed that some estimates of divergence time (especially the root height) seem sensitive to the number of samples I use (removing a few samples I get overlapping but clearly different estimates for the root height).

Well.. I don't want to annoy too much.. If it may be interesting, let me know.

Thanks a lot

Paolo


--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at http://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/groups/opt_out.

paolo gratton

unread,
Feb 25, 2014, 3:51:18 AM2/25/14
to beast-users
Hi

Strangely, I received Remco's answer before David's. And I read David's just now. That's why I did not thank David yet :)..

It seems like the two of you have slightly different ideas.. David said that u = v may be reasonable for SNPs, while Remco disagreed. I have already asked one more question to Remco. I would add that I observed that, placing u = v = 1 instead of two different rates (u = 3, v = 0.6) in my data, resulted in estimates of divergence that are quite different (by a factor of ca. 2).

Plese, David, you can also look at my reply to Remco about applying an absolute rate. One more question is. What do you meeaj that, by unchecking the "non-polymorphic" box the "rate averaged over variable and constant sites". How does SNAPP knows how many constant sites are there in my data, if I use a matrix of variable sites only?

Thanks all

Paolo


Remco Bouckaert

unread,
Feb 25, 2014, 7:58:35 PM2/25/14
to beast...@googlegroups.com
Hi Paolo,


On Tuesday, February 25, 2014 2:11:38 PM UTC+13, PGrat wrote:
- BACK AND FORTH RATES

You say that, with SNPs, one could "easily replace 0s by 1s and vice versa for a site", and that would change the values of u and v. 

But I am confused by the idea of 'averaging' pi_0 and pi_1 (and thus u and v) across loci. For example, imagine I have a data matrix with just two loci. One is coded so that I have pi_0 = 0.75 and pi_1 = 0.25 (but that could be the opposite, since it is completely arbitrary). For this locus, the appropriate rates would be u = 1/(2*0.75) = 0.66 and v = 1/(2*0.25) = 2. For the other locus I could have pi_0 = 0.15 and pi_1 = 0.85, and the appropriate rates would be u = 3.33 and v = 0.59. I don't see why SNAPP should be told that the rates are the 'average' u = 1.11 and v = 0.99, which is not the 'right' value for either loci. What am I missing?


In principle, one can have a mutation rate for each site, but that would mean adding one parameter per site. Since we already have enough trouble trying to estimate all population sizes and other parameters, it would be very hard to estimate each of there rates. So, instead, we have the parameters u and v that we use for each site. Note that u and v are linked in SNAPP, so there is effectively only a single parameter to estimate. I suppose when there is a sufficient number of lineages, u and v could be empirically determined for each site before running the analysis -- just like you did in your example. This is not how SNAPP is implemented, but it might be a good idea to try how that works out. It always bugged me that the rather arbitrary way choosing how a SNP is encoded can impact u and v.


 
- ABSOLUTE RATES

In fact I was just trying to understand how my results can be so 'clean' with regard to thetas and times. As I see it, disentangling time and pop size is always puzzling.

Nonetheless, the trees I am getting (after ESS for all parameters are > 100) have relatively narrow CIs on divergence times.

However, if I try to apply a 'typical' genomic rate (ca. 10^-8 subst./(site*generation)), I get very large estimates for both Ne and times. Actually, I have done a *Beast run on the same samples, using sequences from 8 nuclear loci and a couple of mtDNA genes, and (applying 'standard' mtDNA rates) the estimated times should be at least one order of magnitude lower.

I have also noticed that some estimates of divergence time (especially the root height) seem sensitive to the number of samples I use (removing a few samples I get overlapping but clearly different estimates for the root height).


Did you try different priors/parameter settings on the prior on thetas? It is always a good idea to see how sensitive your results are to the prior and it may explain the extreme time differences you observe.

Cheers,

Remco

 

paolo gratton

unread,
Feb 25, 2014, 8:55:22 PM2/25/14
to beast-users
Hi Remco,

so nice from you to keep discussing..

BACK AND FORTH RATES

thanks for the discussion. In my data, it looks like different choices do not make a huge difference, but I will definitely stick to the way you have proposed in the manuals.

ABSOLUTE RATES and PRIORS ON THETA

Today I have tried to change the prior on theta. First of all, I noticed that, in my previous run, several of the estimated thetas (mostly all short terminal branches) basically reflected the prior (I had used the 'default' prior: alpha = 11.75, beta = 109.73).

I have then set a gamma distribution with the same mean as the default but with a higher standard deviation (alpha = 2, beta = 18.68). On a log scale, this prior is in fact *much* wider.

I have run a short chain, until it appeared to converge, and I noticed that:
1) the variation in theta among branches was, understandably, wider
2) the terminal branches also had much more varied thetas
3) the total height of the tree was NOT much different (much less than an order of magnitude)
4) the tree was quite different, not much about topology but about relative branch lengths; in particular, one divergence was now much shorter but with very tiny thetas along connecting branches. It was like SNAPP 'solved' the issue in an alternative way.. What puzzles me, is that there was not more uncertainty, but rather a decisive switch towards a different 'solution'. Like if the two priors (though one should basically be just wider than the other) channeled SNAPP in different directions. Last note: the 'new' solution looks *very* weird, based on geography, basic measures of genetic differentiation at RAD loci, and mtDNA divergence.

In summary, it looks like results ARE sensitive to the priors. Although not so much to explain the huge divergence times I would get by applying a 'typical' genomic rate.

I am still a bit confused. As I understand it, the priors on theta may well influence branch lengths: if I set a narrow prior at a larger value, all branches with little information will converge to a large size, thus making the times larger, and vice versa, isn't it? However, if the mutation rate was 'correct', I would expect that extremely long times should correspond to ridiculously high effective sizes (N = theta/4mu) and vice versa.. well, I am not sure.. maybe it is too late here in Europe..

Thank you again

Paolo




--

David Bryant

unread,
Feb 25, 2014, 8:59:54 PM2/25/14
to beast...@googlegroups.com

Hi Paulo,

It is a pity that Remco and I are no longer working in the same city....


It seems like the two of you have slightly different ideas.. David said that u = v may be reasonable for SNPs, while Remco disagreed. I have already asked one more question to Remco. I would add that I observed that, placing u = v = 1 instead of two different rates (u = 3, v = 0.6) in my data, resulted in estimates of divergence that are quite different (by a factor of ca. 2).


Remco and I are are either both right or both wrong, depending on interpretation. You have to realise that you are working with (fairly) crude models here, and your choices should be informed by how you obtained your data:
(1) If you coded alleles 1 or 0 randomly at each site, then you will have roughly an equal proportion of 1's and 0's and u=v should work fine (it is only an approximation.
(2) If you have a specific coding (e.g. 0 = R, 1 = Y) then you should estimate u and v from the data (either before, in Beauti, or as part of the MCMC).
(3) If you are working with AFLP data you should code 1 = presence, 0 = absence and estimate u and v from the data.
(4) If you have an outgroup or some other way of deciding ancestral vs derived, then it makes sense to choose 1 for derived and 0 for ancestral [but this is me going by intuition only]
(5) If you really didn't think to much about assigning 0 and 1, but probably didn't assign it randomly, then you should probably re-assign them randomly as I'm not sure what kind of weird biases you might be creating.
 
Plese, David, you can also look at my reply to Remco about applying an absolute rate. One more question is. What do you meeaj that, by unchecking the "non-polymorphic" box the "rate averaged over variable and constant sites". How does SNAPP knows how many constant sites are there in my data, if I use a matrix of variable sites only?


It doesn't. Very important: if you have a matrix of variable sites make sure the non-polymorphic box is checked.


-David.

paolo gratton

unread,
Feb 25, 2014, 9:28:05 PM2/25/14
to beast-users
Hi David, and thanks

1) As you very insightfully guessed :), my coding is 'careless' rather than random. I wrote a script that translate nucleotides into 0s and 1s, and it actually assigns 0 to the first allele it finds, and 1 to the second. Of course, it is much more likely that the first allele is the most common allele in the data, so that I have many more 0s than 1s. Would you advise to code them more randomly or my matrix is fine, as far as I set u and v accordingly?

2) Sorry, but I am confused, and I need to be sure:

yesterday you wrote 'make sure that you have UNCHECKED the "non-polymorphic" checkbox in the mutation model, telling SNAPP that your data set will only include the variable sites'.

Today you write ' if you have a matrix of variable sites make sure the non-polymorphic box is CHECKED'.

I have a matrix of POLYMORHIC sites. What would be the correct choice so that I could use a genomic rate to scale my tree?

CHECK or UNCHECK the "non-polymorphic" box?


Thanks you VERY much!!

Paolo



 


--

paolo gratton

unread,
Feb 26, 2014, 5:40:49 AM2/26/14
to beast-users
Sorry.. I wrote something wrong.

My coding of nucleotides as 1 or 0 is not based on which is found first, but on alphabetical order. Thereofore, if I have many more 0s than 1s (0.72 vs. 0.28) it must be that I actually have more A and C than G and T.. that's puzzling..

Paolo

paolo gratton

unread,
Feb 28, 2014, 12:24:57 PM2/28/14
to beast-users
Hello, David and Remco

sorry for bothering again.. I hope it is almost over :).


1) I must correct myself again. My SNPs were coded according to the most common, so it is normal that I had more 1s than 0s. Now I have re-coded them randomly, and I have almost exactly as many 1s as 0s. Therefore v and u can now be sensibly set at 1.

2) I still need some clarification about the 'non-polymorphic' checkbox.
As I understand it from the manual, it should simply tell SNAPP to ignore invariant sites (when UNCHECKED) or use them to compute the likelihood (if CHECKED).

Since my matrix contains ONLY variable sites, I would expect it to have no effect. But David told me it is important to check it in order to have "correct" mutation rates.

In order to have some feelings, I have tried to run two otherwise identical input files with 'non-polymorphic' checked or unchecked. Actually, I got two very similar trees, differing only slightly for some pop.sizes, but with one (CHECKED) having branch lengths roughly double than the other (UNCHECKED).

3) As to the use of genomic mutation rates, somebody suggested me that I should consider that I only used variable sites selected from 85bp fragments where most sites were invariant. Therefore I should multiply the mutation rate for 85.. Actually it would make my tree much more as expected, but I am not fully convinced..

What do you think?

Thank you a lot!!

Paolo

David Bryant

unread,
Mar 4, 2014, 1:55:26 AM3/4/14
to beast...@googlegroups.com

2) I still need some clarification about the 'non-polymorphic' checkbox.
As I understand it from the manual, it should simply tell SNAPP to ignore invariant sites (when UNCHECKED) or use them to compute the likelihood (if CHECKED).

Since my matrix contains ONLY variable sites, I would expect it to have no effect. But David told me it is important to check it in order to have "correct" mutation rates.

In order to have some feelings, I have tried to run two otherwise identical input files with 'non-polymorphic' checked or unchecked. Actually, I got two very similar trees, differing only slightly for some pop.sizes, but with one (CHECKED) having branch lengths roughly double than the other (UNCHECKED).

If the box is UNCHECKED, SNAPP will expect that everything except polymorphic sites have been removed from the data, and correct the likelihood (and estimates) for that fact. This is what you want.

If the box if CHECKED, SNAPP will assume that the non-polymorphic sites have not been removed. You've said that your data set contains only polymorphic sites. To explain this, SNAPP is going to suggest species trees with long branches, as these are more likely to give data sets without non-polymorphic sites. This explains your observation.


3) As to the use of genomic mutation rates, somebody suggested me that I should consider that I only used variable sites selected from 85bp fragments where most sites were invariant. Therefore I should multiply the mutation rate for 85.. Actually it would make my tree much more as expected, but I am not fully convinced..

What do you think?


You are using only variable sites: make sure the non-polymorphic checkbox is UNCHECKED.
The rate is expected mutations per site, so you wouldn't need to multiply the rate by 85.


-David.

paolo gratton

unread,
Mar 4, 2014, 4:05:51 AM3/4/14
to beast-users
Thank you a lot, David

there has been a little confusion about this. Now it is clear.

I have a curiosity. though. Did you ever try to convert a tree obtained in SNAPP using SNPs data to demographic units by a known mutation rate?

Thanks again

Paolo

David Bryant

unread,
Mar 4, 2014, 8:09:31 PM3/4/14
to beast...@googlegroups.com
No I didn't, but I'd be interested in finding out how it went!

Careful, though. I'm not sure how reliable the absolute dates/branch lengths are, at least not without some prior information on appropriate thetas and/or fossil dates.

-David.

paolo gratton

unread,
Mar 5, 2014, 4:03:50 PM3/5/14
to beast-users
Thank you very much, David, for your continued support!

I am slowly understanding the actual effect of priors on theta.

It looks like, in my analyses, I have little information about theta for the shortest branches, but some more for the longer (usually internal) branches. This would make sense if SNAPP's ability to estimate theta depends on the number of coalescent events (and mutations?) occurring in each branch.

At any rate, I think that a reasonable way to get absolute estimates in demographic units can be something like:

1) pick a sensible mutation rate (e.g from literature.. actually it seems to me that rates of spontaneous mutation estimated, e.g., from genomic sequencing of mutation accumulation lines, are quite consistent across different species [ca. 10^-9 - 10^-8 per generation])

2) set a prior on theta corresponding to reasonable Ne for the populations of interest

3) run SNAPP

4) divide branch lengths by the mutation rate and thetas by 4 times the mutation rate to get absolute estimates

How does it sound?

I wonder if setting a prior on lambda is an effective way to calibrate the tree. According to the manual, the value of lambda sets the prior expected root height. However, in my trials, the 'default' lambda corresponds to an expected root height of ca. 250. While (after setting a prior on alpha and beta so that the gamma mean is 0.001 (Ne = 25,000 individuals) and the gamma variance is 0.0003 (7,500 individuals), I get a posterior estimate for the root height of ca. 0.0015 (150,000 generations), which is reasonable to me, but VERY far from the expected value given lambda.

So, it looks like the prior on theta largely dominates the MCMC.

I am now trying to play with lambda to get a feeling of its influence.

Thanks again

jason....@gmail.com

unread,
Nov 10, 2015, 10:51:03 AM11/10/15
to beast-users
Hi Paolo,

I would love to hear about what you found.

Thanks,
Jason
Reply all
Reply to author
Forward
0 new messages