Nested Sampling, StarBeast, and sampling frequency

214 views
Skip to first unread message

Kate Naughton

unread,
Jul 11, 2019, 2:53:51 PM7/11/19
to beast-users
Hello all,

I'm currently trying to tie up some species delimitation work, using *BEAST (Beast 2.5.2) with the Nested Sampling package. I'm using 50 exon sequences, and I have three models with variously assigned taxa. I'm exploring the analyses using both my own Windows machine and CIPRES. I don't have any problems getting it to run - the NS tutorial was very helpful in sorting this out.

However, I do have questions about what sort of sampling frequency to specify in the XML file. I'm using 400 particles (information content maxes out at just under 1600), subchain length of 5000, and I've set the chain length itself to 250,000 (in reality, the run stabilises and terminates before it hits 200,000).

In a standard MCMC run, I base my sampling frequency on chain length, and how many trees I want (<10k); I believe I've mistakenly used the subchain length for guidance here, and have ended up with (in one case) nearly 160k trees! Needless to say, this resulted in an error (apologies that I didn't retain the error text - "more trees than samples" is the general message) and enormous, unwieldy outfiles (now deleted. I kept the marginal likelihood estimates out of interest).

I set "StoreEvery" and "LogEvery" to 1000 in my xml file, thinking to strike a balance, and it's obviously backfired. Since it looks like NS is a relative newcomer to Beast2, I haven't been able to find any other examples of people using NS with *BEAST.

TL;DR: can anyone explain the mathematical relationship between chain length, number of particles, sub-chain length, and excessive tree numbers in the NS analyses? I'd like to be able to set a sensible value.

Thanks in advance for your help,
-Kate
PS If necessary, I can reliably reproduce the error text - but it will take me a couple of days.

Remco Bouckaert

unread,
Jul 14, 2019, 7:34:28 PM7/14/19
to beast...@googlegroups.com
Hi Kate,

If you are not interested in the posterior sample, you can remove the tree log from the XML (or comment it out) to prevent excessively large tree files to be generated. 

The number of trees generated is a function of the information (one of the quantities estimated by nested sampling) and linear in the number of particles. So, more particles means larger tree files. The sub-chain length has no impact on the number of trees logged (assuming the sub-chain length is large enough, with inappropriately small sub-chain length, the analysis may get stuck prematurely, resulting in a smaller but inaccurate posterior tree sample). The chain length is just an upper bound on the number of steps. If you choose it too small, the analysis may stop prematurely, if it is too large (which is the safer choice) the default stopping criterion is used.

Cheers,

Remco

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/beast-users/d6180c11-1547-445d-a4f9-7fe54e5b3e9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kate Naughton

unread,
Jul 18, 2019, 2:22:00 PM7/18/19
to beast-users
Hi Remco,

Thank you sincerely for the swift reply - ditching the tree logger has been a huge relief, as in the present case I'm not interested in the posterior sample or anything other than the marginal likelihood and standard deviation estimates. The explanation regarding how various parameters interact is also appreciated (I felt the tutorial got most of it across, but missed some key points), particularly the irrelevance of sampling frequency. My chains now get set to ludicrously high values, and I've finally had a 400 particle analysis resolve (previously I said they terminated under 200k steps, this was an error and referred to the 64 particle run, which provided a standard deviation that was too high to distinguish my models).

I have hit a GC overhead limit, but that's (a) not surprising, (b) fixable and (c) irrelevant to my final numbers (this time).

I do have a question regarding the output of a nested sampling run: although the FAQ states that the three estimates of the marginal likelihood provided at the end of the run should not be different enough to worry about (as long as the appropriate SD is reported), I'm still puzzled by the difference in how each is generated - one has the SD marked "subsample", another "bootstrap", and the third is presented simply with brackets around the standard deviation value.

Assuming it can be explained in broad terms, how are these three SDs produced, as distinct from yet another SD value provided below? Also, which of the marginal likelihood estimates is supposed to match that fourth SD (printed under information content)? Or is that SD supposed to accompany the final value printed to screen before the conclusion of the run?

e.g. (from an earlier run)
Marginal likelihood: -23154.59891321004 (bootstrap SD=5.099546305106115)
Marginal likelihood: -23154.426732159405 (subsample SD=4.916936211187053)
Marginal likelihood: -23154.964067192115(4.375577431226804)
Information: 1440.6436677045785
SD: 4.744476505146173

On the one hand, these values are all clearly within spitting distance of one another, so it will make no difference to my model selection (the desired outcome) - on the other hand, I can see how I may need to describe the values in a paper, or explain to reviewer why I selected a particular value. It might not matter from an immediate practical standpoint, but it might end up mattering from a publication standpoint (which is another kind of practical standpoint).

Thanks again for the help,
Cheers,
-Kate


On Monday, July 15, 2019 at 9:34:28 AM UTC+10, Remco Bouckaert wrote:
Hi Kate,

If you are not interested in the posterior sample, you can remove the tree log from the XML (or comment it out) to prevent excessively large tree files to be generated. 

The number of trees generated is a function of the information (one of the quantities estimated by nested sampling) and linear in the number of particles. So, more particles means larger tree files. The sub-chain length has no impact on the number of trees logged (assuming the sub-chain length is large enough, with inappropriately small sub-chain length, the analysis may get stuck prematurely, resulting in a smaller but inaccurate posterior tree sample). The chain length is just an upper bound on the number of steps. If you choose it too small, the analysis may stop prematurely, if it is too large (which is the safer choice) the default stopping criterion is used.

Cheers,

Remco
On 11/07/2019, at 8:46 PM, Kate Naughton <kmnau...@gmail.com> wrote:

Hello all,

I'm currently trying to tie up some species delimitation work, using *BEAST (Beast 2.5.2) with the Nested Sampling package. I'm using 50 exon sequences, and I have three models with variously assigned taxa. I'm exploring the analyses using both my own Windows machine and CIPRES. I don't have any problems getting it to run - the NS tutorial was very helpful in sorting this out.

However, I do have questions about what sort of sampling frequency to specify in the XML file. I'm using 400 particles (information content maxes out at just under 1600), subchain length of 5000, and I've set the chain length itself to 250,000 (in reality, the run stabilises and terminates before it hits 200,000).

In a standard MCMC run, I base my sampling frequency on chain length, and how many trees I want (<10k); I believe I've mistakenly used the subchain length for guidance here, and have ended up with (in one case) nearly 160k trees! Needless to say, this resulted in an error (apologies that I didn't retain the error text - "more trees than samples" is the general message) and enormous, unwieldy outfiles (now deleted. I kept the marginal likelihood estimates out of interest).

I set "StoreEvery" and "LogEvery" to 1000 in my xml file, thinking to strike a balance, and it's obviously backfired. Since it looks like NS is a relative newcomer to Beast2, I haven't been able to find any other examples of people using NS with *BEAST.

TL;DR: can anyone explain the mathematical relationship between chain length, number of particles, sub-chain length, and excessive tree numbers in the NS analyses? I'd like to be able to set a sensible value.

Thanks in advance for your help,
-Kate
PS If necessary, I can reliably reproduce the error text - but it will take me a couple of days.

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast...@googlegroups.com.

Remco Bouckaert

unread,
Jul 21, 2019, 10:27:25 PM7/21/19
to beast...@googlegroups.com
Hi Kate,

The difference between these estimates is rather technical in nature — I hope to explain it without getting lost in too much detail.

Nested sampling produces a sequence of save points with increasing likelihoods L1..Lk, and each likelihood is associated with a series  X1…Xk where Xi associated with likelihood Li is the probability of that a state in the prior has at least likelihood Li. Now, the likelihoods L1…Lk are exactly known, but the Xi values are too complex to calculate. However, we know that they form a sequence of values starting at X1=1 and that following Xi is distributed according to a beta(N,1) distribution times the previous value X(i-1), where N is the number of points for the nested sampling analysis. 

There are different ways to threat this random variable. The simplest is taking the mean value estimate, and Xi=exp(-i/N) as is used in the last estimate in the output.

An alternative is to randomly sample Xi as a fraction of the previous X(i-1) by multiplying with  a randomly sampled value from the beta(N,1) distribution. This (sampling from beta distribution) can be repeated several times, and the mean and variance of the ML estimate estimated based on the sample. By resampling the likelihoods with replacement at the same time gives the estimate called "subsample estimate”. There is another way to get likelihoods by unraveling the sequence (as described in https://arxiv.org/pdf/1703.09701.pdf Algorithm 2) that produces what is labelled “bootstrap sample”.

Regardless, all these methods give approximately the same estimate — if things are fine (as they do in your case) they should not differ substantially. 

Cheers,

Remco


Kate Naughton

unread,
Jul 22, 2019, 2:19:53 AM7/22/19
to beast...@googlegroups.com
Hi Remco,

Thanks for that! This is making decent sense to a non-mathematician (or rather, a molecular ecologist who has ended up revisiting math in a sideways fashion due to bioinformatic requirements) and so is much appreciated.

I currently have three sets of nested sampling runs going (work machine, home machine, CIPRES) and from the look of it I'm going to have another question when they're done, but I'll start a new thread for it.

cheers
-Kate

--
You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/0g9vytZwuX8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beast-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beast-users/5C7F76A3-7843-4E1D-A8E3-BEE1076BED89%40gmail.com.
Reply all
Reply to author
Forward
0 new messages