Priors and runtimes

Jane

unread,

May 21, 2018, 2:29:27 PM5/21/18

to beast-users

Dear All,

I'm setting up an analysis broadly following the BFD* tutorial, and I have two questions:

1. When it comes to prior specification, there is very little known about the species I work on, and as such it is difficult to make sensible estimates of tree height and population sizes (for the lambda and theta values), or choose a sensible distribution. My plan therefore is to allow SNAPP to estimate these values in the mcmc chain (with broad upper and lower bounds). I then plan to validate these by doing an additional run using modified priors, as suggested in the Bryant et al., 2012 paper. Does this sound like an appropriate way forward?

2. My preliminary runs are taking prohibitively long to run (for example, I set one up 4 days ago to run for 100,000 MCMC steps, and the likelihood.log file still simply says "Sample Likelihood" ie. (I think) it still hasn't completed 1 of these 100,000 steps!). My data consists of 115 individuals and 1762 SNPs. Reading around on this group, the general advice seems to be to play around with threading, which I've done without much success. Failing that, I've just set up a run with 34 individuals, in the hope that this will result in more feasible runtimes - is there a minimum number of individuals per species needed for BFD? and does anyone have any advice for speeding up runtimes in addition to the above?

Thanks very much,

Jane

Remco Bouckaert

unread,

May 21, 2018, 3:27:36 PM5/21/18

to beast...@googlegroups.com

Hi Jane,

Is it possible that you set a value for preBurnin in the XML? There will be no output if you set preBurnin till all pre-burnin samples have been taken by the MCMC chain. It is good practice to start with a smaller number of individuals to see how SNAPP behaves, and keep adding sequences till it runs too slow to work with.

Unfortunately, using fewer individuals means there will be less signal for estimating populations sizes: if you only have a single individual per species there will be on average one coalescent event per branch, so population sizes will typically be dominated by the prior. So whether the strategy of refining the prior works depends a bit on how many sequences can be handled by SNAPP. Maybe the additional information on priors in the BFD* tutorial (https://github.com/BEAST2-Dev/beast-docs/releases/download/v1.0/BFD-tutorial-2017.zip) may help.

Cheers,

Remco

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.

Jane

unread,

May 22, 2018, 2:41:44 PM5/22/18

to beast-users

Hi Remco,

Thanks for the quick and helpful response. I did set a pre-burnin, so that explains the empty likelihood.log file so far.

Given what you've said about the trade-ff between using fewer individuals for a feasible run time and needing larger numbers of individuals for estimating the prior, I wonder if it would be possible to split the dataset? My whole dataset of 115 individuals contains about 10 species (give or take, hence this analysis!), and maximum likelihood trees inferred using RAxML show that these individuals fall into about 5 distinct clades, with long branch lengths separating these clades from each other. Would it be appropriate to run BFD individually for each of these clades, testing a 1 species vs 2 or 3 species scenario within each. This would mean that each beast run would have far fewer individuals, so should run much faster, but also will have on average about 10 samples per species, hopefully solving both of the above issues? I could then use the results of these independent comparisons to establish the number of species present in each clade, and therefore in the entire group of 115 individuals. Of course only marginal likelihoods from the same clade would effectively be used to calculate Bayes Factors. Is there a reason why this isn't a sensible solution?

Thanks, and I hope that's clear.

Jane

Remco Bouckaert

unread,

May 22, 2018, 4:25:24 PM5/22/18

to beast...@googlegroups.com

Hi Jane,

What you write certainly sounds like a good strategy to reduce the number of lineages per analysis, given the only thing you want to do is species delimitation. As you already mentioned, you can only compare marginal likelihoods from analyses that use the same sets of sequences. If it is computationally feasible, you may want to add individuals from the closest neighbour of the clade you want to delimitate, just to make sure there is no interaction from higher up in the tree, though given the long branch lengths you mention it should not make much of a difference.