Hi all,
After a good deal of testing and finagling, I'm finally using Nested Sampling to compare three models of species delimitation under StarBeast. It has taken a while to fine-tune the sub-chain lengths, and I had a couple of questions.
For interest's sake: 24 individuals, 50 exon sequences, linked site model (JC69), strict clock. Two of the models contain three taxa, the third model contains only two. Sub-chain length 300k, information content ranges from 1400-1600, and I'm running with 396 particles (18 threads, 22 particles per thread, on a c5 9xlarge on Amazon Web Services).
(1) Auto-calibration:
There appears to be an auto-calibration option for the sub-chain length, although it's not possible to use it in a multi-threaded analysis - so unless you're running a fairly small number of particles, I'm assuming the main usefulness (in large analyses) is to run it once, with one particle, to see where the sub-chain length maxes out. We've done this on my end, and we've ended up with a sub-chain length of 300,000 - which is a bit of a slog.
Is there a reason that this feature has not been included in the tutorials or "how-to-use" on Github? I've run a few tests with single particles and the results are consistent, so it's been validated fairly well - and if I were certain, it would have saved some time. A few tests suggested 200k would be sufficient (i.e., running with increasing sub-chain length until consistent estimates are produced), and if one of those runs hadn't popped up with an odd result that was very different form the previous, I might have relied on that, with no reason to doubt the results.
(2) Stop-factor:
Another parameter we ended up playing with was the stopFactor. The default value (2) with my data means that an analysis would stabilise, unable to improve the likelihood, and it would sit on those values for about a third of the run time - repeated tests have shown that the runs stabilised at approximately 1.10 * H * N, and dropping the stopFactor down to 1.25 has saved a decent number of CPU hours (and given that this analysis has a sub-chain length of 300k, these runs take over a week across 18 CPUs), while still showing a lengthy stabilisation period (which is reassuring).
Is there any reason we shouldn't do this? I'm assuming that a safe stopFactor will vary with the depth and complexity of the data, and the shape of the likelihood space.
(3) resume capability:
Apologies if this has been picked up already, I see it's on the list of features to add - I just wanted to ask if there was any progress there. I sadly don't have a UPS available for my desktop machine (hence the AWS time), which means that a power outage of less than ten seconds lost ten days of processing. I realise that it's probably a lot more complex to add this functionality than it looks, but it would be extremely reassuring to have it in place!
(4) citations:
I'm starting to put together the paper, and I went and had a quick look for other papers that have used the Nested Sampling under BEAST2, and I couldn't find any on Web of Science (either using keywords or citation search) - which isn't surprising, given how new the package is; but it's possible I've missed something. Do you know of any so far?
Thanks again for the help - in spite of some of the difficulties listed above, I'm very excited about this particular analysis the options that it opens up for future research questions.
cheers,
-Kate