nested sampling on Beast 2

jasper...@gmail.com

unread,

Nov 29, 2018, 11:12:33 PM11/29/18

to beast-users

Hello!

My questions might be quite naive since I'm new to using nested sampling.

1. Can I use nested sampling to select the optimal priors (i.e. Yule model, constant coalescent) for my data apart from nested sampling being used for model selection i.e. strict or relaxed clock?

2. I'm really not sure how long the sub-chain length should be (MCMC).

I read this from a paper by Maturana et al 2018:

"On the other hand, NS only requires the number of active points and the number of MCMC steps used to generate the replacement points. The latter should be chosen in relation to the number of parameters. This number should exceed the dimension of the parameter space in order to guarantee the generation of an independent point, otherwise the marginal likelihood estimate will be biased lower. In our experience, it should be at least 10 times more than the number of parameters, but it is recommended to try different values."

Im not really sure if I understand this very well. Which parameters are being referred in this paragraph? Does it refer to the number of parameters in the setting when performing nested sampling i.e. 4 so then it should be 4x 10= 40 as suggested?

I read from a blog on Nested sampling from Beast website:

"The other tuning parameter is the sub-chain length. This plays the most important role in the reliability of the method. It defines the number of MCMC steps to generate a new point from the prior with the likelihood restriction at each iteration. NS works in theory if and only if the points generated at each iteration are independent. Therefore, this number should be large enough to meet this condition. Otherwise, the method produces underestimates. The implementation of NS in BEAST2 has the option to auto-calibrate this tuning parameter."

Maybe, this is the answer to my previous question. I have not figured out though how to auto-calibrate this tuning parameter so it would give me an idea how long the MCMC run should be. Is there a tutorial to do that?

3. This paragraph from the Nested Sampling blog is quite ambiguous for me:
"If the two models with the highest marginal likelihoods are reasonably far away in terms of their standard deviations, we choose the one with the highest value.Here one can be conservative and prefer a model if and only if the winning model is very far away in terms of SDs. On the other other hand, if the two highest marginal likelihoods are close to each other in terms of their standard deviations, the estimate procedure can be performed again, but with a larger number of active points to increase the precision of the estimates and then decide."

Let me know if I understand it correctly. So there were two competing models being compared in this paragraph and the marginal likelihood for each of this model was computed with their corresponding standard deviations. Given that each of them has a high standard deviation (e.g. more than 1 like 30), the winning model should be the one with the highest marginal likelihood (value with the lowest negative value). But if the standard deviation of each the model is quite small (like each of them has sd=1), then I should repeat the Nested sampling analysis for each of the two models by increasing their number of active points. Is my understanding correct?

4. Finally, below, which of the likelihood should I check. There was a similar question from the previous post. Im just double checking if I am correct that any of these marginal likelihoods can be used for model selection given that I have enough number of subchain length and active point.

"Total calculation time: 1127.811 seconds

End likelihood: 224.80084680134817

Producing posterior samples

Marginal likelihood: -3237.6076331280424 sqrt(H/N)=(18.032373246914972)=?=SD=(18.120594329894292) Information: 325.16648491605474

Max ESS: 25.758872875344174

Processing 653 trees from file.

Log file written to /Users/Jasper/Documents/Tetrastigma/Phylogenetic_analysis/NS/ITSonly_83samples_RelaxLogYule.posterior.trees

Done!

Marginal likelihood: -3238.016583866134 sqrt(H/N)=(18.05014040335818)=?=SD=(17.726293702607325) Information: 325.80756858094344

Max ESS: 25.904104821434654

Log file written to /Users/Jasper/Documents/Tetrastigma/Phylogenetic_analysis/NS/ITSonly_83samples_RelaxLogYule.posterior.log

Done!"

Thanks for bearing with me and I hope you find time to help me. I have attached my xml file just in case you need to try it.

Cheers,

Jasper Obico

ITS_83samples_RelaxLogYule_NS.xml

Remco Bouckaert

unread,

Dec 2, 2018, 9:05:38 PM12/2/18

to beast...@googlegroups.com

Hi Jasper,

On 30/11/2018, at 5:12 PM, jasper...@gmail.com wrote:
1. Can I use nested sampling to select the optimal priors (i.e. Yule model, constant coalescent) for my data apart from nested sampling being used for model selection i.e. strict or relaxed clock?

You can consider the priors to be part of the model, just like the clock models, so you can use Bayesian model selection, e.g., based on nested sampling, to select the most appropriate model.

2. I'm really not sure how long the sub-chain length should be (MCMC).

I read this from a paper by Maturana et al 2018:
"On the other hand, NS only requires the number of active points and the number of MCMC steps used to generate the replacement points. The latter should be chosen in relation to the number of parameters. This number should exceed the dimension of the parameter space in order to guarantee the generation of an independent point, otherwise the marginal likelihood estimate will be biased lower. In our experience, it should be at least 10 times more than the number of parameters, but it is recommended to try different values."

Im not really sure if I understand this very well. Which parameters are being referred in this paragraph?

It was meant to refer to all the parameters in the model, including node heights in the tree, topology of the tree, rates for clock models, etc.. Note that an ultra metric tree with n taxa requires (n-1) heights for internal nodes and 2n-2 parents, so 3n-3 parameters (when tip-sampling, add another n).

Does it refer to the number of parameters in the setting when performing nested sampling i.e. 4 so then it should be 4x 10= 40 as suggested?

Yes, if you only have 4 parameters, but the tree typically has many more parameters and the factor 10 is only a lower bound. Different operator weights make that difference parameters may get sampled with difference frequency, which may mean that you need to increase the factor 10 to something higher.

Running with at least 10x the number of parameters, then running again with 20x to see whether estimates remain consistent is the most robust way to make sure that the chain length is large enough.

I read from a blog on Nested sampling from Beast website:
"The other tuning parameter is the sub-chain length. This plays the most important role in the reliability of the method. It defines the number of MCMC steps to generate a new point from the prior with the likelihood restriction at each iteration. NS works in theory if and only if the points generated at each iteration are independent. Therefore, this number should be large enough to meet this condition. Otherwise, the method produces underestimates. The implementation of NS in BEAST2 has the option to auto-calibrate this tuning parameter."

Maybe, this is the answer to my previous question. I have not figured out though how to auto-calibrate this tuning parameter so it would give me an idea how long the MCMC run should be. Is there a tutorial to do that?

Hmm, after further testing that option it turned out to result in under-estimates of the SD, so it is better to not use this option.

3. This paragraph from the Nested Sampling blog is quite ambiguous for me:
"If the two models with the highest marginal likelihoods are reasonably far away in terms of their standard deviations, we choose the one with the highest value.Here one can be conservative and prefer a model if and only if the winning model is very far away in terms of SDs. On the other other hand, if the two highest marginal likelihoods are close to each other in terms of their standard deviations, the estimate procedure can be performed again, but with a larger number of active points to increase the precision of the estimates and then decide."

Let me know if I understand it correctly. So there were two competing models being compared in this paragraph and the marginal likelihood for each of this model was computed with their corresponding standard deviations. Given that each of them has a high standard deviation (e.g. more than 1 like 30), the winning model should be the one with the highest marginal likelihood (value with the lowest negative value). But if the standard deviation of each the model is quite small (like each of them has sd=1), then I should repeat the Nested sampling analysis for each of the two models by increasing their number of active points. Is my understanding correct?

If the difference of marginal likelihoods is small, it may be that due to a large standard deviation in their estimates it cannot be claimed that they differ from each other. To be sure, you need estimates with lower SDs, which you get when using more particles.

4. Finally, below, which of the likelihood should I check. There was a similar question from the previous post. Im just double checking if I am correct that any of these marginal likelihoods can be used for model selection given that I have enough number of subchain length and active point.

"Total calculation time: 1127.811 seconds
End likelihood: 224.80084680134817
Producing posterior samples

Marginal likelihood: -3237.6076331280424 sqrt(H/N)=(18.032373246914972)=?=SD=(18.120594329894292) Information: 325.16648491605474
Max ESS: 25.758872875344174

Processing 653 trees from file.
Log file written to /Users/Jasper/Documents/Tetrastigma/Phylogenetic_analysis/NS/ITSonly_83samples_RelaxLogYule.posterior.trees
Done!

Marginal likelihood: -3238.016583866134 sqrt(H/N)=(18.05014040335818)=?=SD=(17.726293702607325) Information: 325.80756858094344
Max ESS: 25.904104821434654

Log file written to /Users/Jasper/Documents/Tetrastigma/Phylogenetic_analysis/NS/ITSonly_83samples_RelaxLogYule.posterior.log
Done!"

Thanks for bearing with me and I hope you find time to help me. I have attached my xml file just in case you need to try it.

There is now a FAQ for nested sampling related questions here: https://github.com/BEAST2-Dev/nested-sampling/wiki/FAQ which hopefully clarifies things somewhat.

From the new FAQ: The difference between the estimates is the way they are estimated from the nested sampling run. Since these are estimates that require random sampling, they differ from one estimate to another. When the standard deviation is small, the estimates will be very close, but when the standard deviations is quite large, the ML estimates can substantially differ. Regardless, any of the reported estimates are valid estimates, but make sure to report them with their standard deviation.

Hope this leaves you a little less confused,

Remco

Jasper Obico

unread,

Dec 3, 2018, 3:48:30 PM12/3/18

to beast...@googlegroups.com

Hi Remco,

Thanks for patiently answer all my questions. I'll try my best to digest all these.

Cheers,

Jasper

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.

Jasper Obico

unread,

Dec 3, 2018, 10:20:24 PM12/3/18

to beast...@googlegroups.com

Hello!

I have been using nested sampling on BEAST 2. It ran successfully with my xml file using this setting:

However, when I ran the same file with the same setting except for a change on the subchainlength to 20000, it gave me an error message. Please attached.

Also, I tried running the same file on CIPRES and it did not work. Seems like the BEAST 2 version on CIPRES has not been updated with nested sampling. Im not sure if this is true. Let me know what you think.

Cheers,

Jasper

CIPRES Screen Shot 2018-12-04 at 4.17.55 PM.png

BEAST Screen Shot 2018-12-04 at 4.11.23 PM.png

Remco Bouckaert

unread,

Dec 4, 2018, 5:46:39 PM12/4/18

to beast...@googlegroups.com

Hi Jasper,

The error message from editing the XML suggests perhaps your editor accidentally inserted some characters that does not show up in the editor itself, but is part of the file and confuses the XML parser. What did you use to edit the file? Notepad (Windows), TextWrangler (OS X), vi and emacs don’t have this problem.

As you suspected, the error message on CIPRES suggests that the NS package is not installed.

Cheers,

Remco

<CIPRES Screen Shot 2018-12-04 at 4.17.55 PM.png><BEAST Screen Shot 2018-12-04 at 4.11.23 PM.png>

Jasper Obico

unread,

Dec 4, 2018, 10:04:30 PM12/4/18

to beast...@googlegroups.com

Hi Remco,

Thanks for the reply. I was using text editor app on my mac to open and edit the script. I finally figured out the by pasting the original script from the tutorial onto Microsoft word, no hidden characters came up when I edited the script.

So my hunch was correct then. I already sent a message to CIPRES about this. Hopefully, they update the BEAST 2 with Nested sampling function so I can have a platform to run my analysis faster in a cluster. This leads me to another question.

1. Do you have any suggestions how to make multiple runs to try out different number of particles and subchain length? Is this paragraph from the FAQ the answer to my question and how do I do this?

"The parallel implementation makes it possible to run many particles in parallel, giving a many-particle estimate in the same time as a single particle estimate (PS/SS can be parallelised by steps as well)"

Thanks again!

Cheers,

Jasper

Reply all

Reply to author

Forward