SNAPP results scale question

698 views
Skip to first unread message

RasmusHeller

unread,
Sep 9, 2013, 8:02:39 AM9/9/13
to beast...@googlegroups.com
Hi,

I am having some problems interpreting the outcome of a SNAPP analysis on 500 (randomly selected) SNPs from three species. Depending on which priors I use for coalescence rate and lambda I get wildly different posterior estimates for theta and the TreeHeight parameter. I even get different relative magnitudes of the thetas (assuming that the theta indexes are consistent between runs). This puzzles me and makes me wonder whether I am misunderstanding some aspect of the parameterization in SNAPP and how the different parameters scale. I attach a overview of the mean parameter estimates from Tracer. The forward and backwards mutation rates also differ among the runs, but not nearly enough to explain the discrepancy in theta or tree height. Tree height should be in units of mutations, right? And theta is just 4*N*mu. I know I should probably run longer chains, but so far I am exploring and the results don't encourage me to start a week-long run. I also am not sure I understand the purpose and appropriate settings for u and v. It says in the SNAPP manual that "The theta values are unaff ected by this rescaling. If the true mutation rate is known, then the values returned by the program can be converted into e ffective population sizes using N = 4*N*mu".

Does anyone have comments to how I should go about the analysis? I want to know the branching order of the three species, their divergence times and relative population sizes. The two latter ideally in real units, but initially just their mutation rate scaled values.

Thanks,

Rasmus
Tracer_overview.pdf

Paolo Gratton

unread,
Mar 3, 2014, 12:39:03 PM3/3/14
to beast...@googlegroups.com
Hi Rasmus,

I am having similar experiences to yours. I did not play with lambda and 'coalescent rate' priors, but my results are certainly sensitive to priors on theta (alha and beta with the gamma prior).

I am also struggling to undersand what 'the true mutation rate' should be. I have received answers from Remco and David under the thread 'SNAPP mutation rates', you can read them in this user group.

Did you solve the problem somehow or managed to get results you could trust? Did you figure out what a reasonable 'true mutation rate' should be?

Best

Paolo

David Bryant

unread,
Mar 4, 2014, 2:39:25 AM3/4/14
to beast...@googlegroups.com
Hi Rasmus,

I can see why you are confused.

Firstly, I think there might be a bug creeping in since SNAPP shouldn't change the overall rate  (at least I'm not sure that it would work properly if they are). Remco is visiting here early next week and we'll look at that, however I'd suggest fixing u and v at their estimated values in the meantime.

However I think that your real problem is that when you have polymorphic characters only, it can be extremely difficult to identify both population size  and branch length / tree height. Increasing the branch length has almost exactly the same effect on the likelihoods as decreasing the population size. Under the infinite sites model, the two actually become formally non-identifiable. (This is a flip side of S being a sufficient statistic for theta, if you're interested). This non-identifiability is not a shortcoming of SNAPP, but a mathematical inconvenience of the data/model.

Note that in your case the values have gone all over the place, but the relative values (e.g. theta / treeheight) are far more conserved. If you plot one against the other in tracer you should see a strong correlation.

What are the options?
(1) Good prior information on treeheight or theta should fix the scale problem, or
(2) I would guess that full sequence analyses (e.g. use variable and non-variable sites, and make sure to check the box 'use non-polymorphic characters) would help, as would a full gene sequence approach (e.g. Beast / *-Beast) as the gene sequences have more rate information.
(3) Compute 'coalescent units' which are in units of 2N generations.
For the conversion (and this is at the end of a long day, so please check this, sorry)
    theta = 4 * N * (mutations per site per generation)
    branch length = (mutations per site)
so the number of lots of 2N generations is
   0.5*branchlength/theta

These units seem to be standard among the gene-tree, species tree crowd.

cheers,

David.

paolo gratton

unread,
Mar 4, 2014, 4:16:46 AM3/4/14
to beast-users
Hi David,

I am now making my mid more clear about the limits with analysing polymorphic data only.
I think you could consider being more explicit about this in the SNAPP website / manual, since it is very important.

I see that you are suggesting to use *BEAST, which I already tried (although my impression is that you cannot practically use loci with just 1 polymorphic site in *BEAST since estimating the gene tree for each of them is almost impossible, so that the runs will never converge).
But you also mention using 'full sequence'. Of course I could include non polymorphic sites, bu then the nucleotides won't be unlinked any more...

Well, maybe I should look for a different job

Thank you a lot

Paolo


--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at http://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/groups/opt_out.

paolo gratton

unread,
Mar 4, 2014, 5:10:58 AM3/4/14
to beast-users
I am sorry.. one more thing..

 if "Increasing the branch length has almost exactly the same effect on the likelihoods as decreasing the population size" (which seems logical to me), why SNAPP returns ancestral populations sizes that are NOT the same as the prior, and have narrow CIs? as well as pointed estimates of branch lengths? I just don't understand. Either there is information in the data about both branch lengths and pop. sizes or I would expect that the posterior would be the same as the prior (in my analyses, this is actually the case for most SHORT terminal branches, but not for longer branches and for internal branches).

Where am I wrong?

Thank you all

Paolo

David Bryant

unread,
Mar 4, 2014, 8:22:16 PM3/4/14
to beast...@googlegroups.com
HI Paolo,


On Tuesday, March 4, 2014 10:16:46 PM UTC+13, PGrat wrote:
Hi David,

I am now making my mid more clear about the limits with analysing polymorphic data only.
I think you could consider being more explicit about this in the SNAPP website / manual, since it is very important.


One problem is that I'm not completely sure of the extent in the limitations. We raised the question of identifiability in our paper, and showed that branch lengths and thetas are technically identifiable. However, that is going to depend a great deal on your context, and where coalescences occurred in the species tree you are trying to recover. For example, if there were no coalescences along a particular branch there will be no way, using any analysis of the data) of recovering information about theta. I agree some discussion of this would be apt in the manual - I'd been hesitating until I properly understood all the issues myself!


 
I see that you are suggesting to use *BEAST, which I already tried (although my impression is that you cannot practically use loci with just 1 polymorphic site in *BEAST since estimating the gene tree for each of them is almost impossible, so that the runs will never converge).
But you also mention using 'full sequence'. Of course I could include non polymorphic sites, bu then the nucleotides won't be unlinked any more...


Reality is inconvenient at times. As a preliminary, you might try an 'empirical Bayes' approach to setting up your priors on theta (or, more responsibly, put aside some data to determine a priors). For each population, can you perhaps use the proportion of segregated sites to estimate theta, then fit the alpha and beta parameters of the 'coalescent rate' distribution from your estimates? If there are any other theory folk out there, does that sound reasonable? A decent prior on the thetas would then help a lot with the (practical) non-identifiability.

 
Well, maybe I should look for a different job


you me both.

-D.


 

David Bryant

unread,
Mar 4, 2014, 8:25:22 PM3/4/14
to beast...@googlegroups.com
Three possibilities

(1) You've struck it lucky and there is enough information in your data to discriminate these.
(2) Your priors contribute more than you expect (try multiplying alpha by 10 to see if all those tight peaks move as well)
(3) Convergence problems (check ESS in tracer)

-David,.
Reply all
Reply to author
Forward
0 new messages