StarBeast3 posterior ESS

Else

unread,

Jul 20, 2022, 3:22:51 PM7/20/22

to beast-users

Hello,

I am building a time-calibrated phylogeny with Starbeast3, and I am running into some difficulty getting ESS scores above 200, so I am hoping to get any advice for how to speed the process.

I have 8 taxa with 2-6 haplotypes per taxon, and 100 loci of 5000 bp each (actually I have thousands of loci, but it was running too slow with more than 100 loci). I have run for about 35,000,000 generations with several replicates, and it seems like it is having difficulty increasing ESS for posterior, for example one replicates has ESS=30 for posterior, ESS=33 for species.coalescent, and ESS=34 for Tree.t:Species.treeLength. The posterior is highly correlated with species.coalescent, and negatively correlated with Tree.t:Species.treeLength. Are these correlations expected, or does that suggest that there is something off with my model that I should change? Is there any way I can help posterior increase faster?

I am running it with CoupledMCMC with two hot and one cold chain, and with these parameters set up in Beauti:
* Each of the 100 genes has an HKY site model with 4 gamma categories
* uncheck "automatic set clock rate"
* Clock Model: Species tree strict clock, starting value Clock.rate=0.001, check "estimate" for clock rate
* Tree.t:Species: Calibrated Yule Model
* prior for clock model: SpeciesTreeStrictClockRate=log normal, M=0.002, S=2.0, check "Mean in real space"
* prior for age of root: Log normal distribution. Mean = 36.6, S = 0.001, "mean in real space"
* MCMC: store every 5000 for logs
* set scaleFactor for operator strictClockRateScaler to 0.015 (recommended by a previous run)
* set scaleFactor for operator TreeRootScaler.t:Species to 0.002 (recommended by a previous run)
* set scaleFactor for operator BactrianNodeOperator.t:Species to 0.067 (recommended by a previous run)

Are there any other settings that I could change to allow it to reach stationary? I have run it for many days on 120 cores so am worried that it will never reach ESS>200, since the posterior trace still looks very jagged.

I also have very low ESS scores for the TreeDistanceUPGMA.t and TreeDistanceNJ.t parameters for each gene, and ESS<100 for CalibratedYuleModel.t:Species & cySpeciationRate.t:Species - would this suggest a problem with the analysis, or could the results be publishable with those parameters still below ESS=200? The purpose of the analysis is to date the divergence of one particular node.

Thank you,

Else

Jordan Douglas

unread,

Jul 21, 2022, 7:49:10 PM7/21/22

to beast-users

Hi Else,

The problems you are seeing and unfortunately quite standard for a dataset this size. The model is very complex and dataset this size often requires over 100,000,000 or 200,000,000 generations to converge, in my experience.

The correlations you saw are nothing to worry about:

- The positive correlation you see between posterior and species.coalescent is normal because the species.coalescent is a prior probability, and the posterior goes up with the prior.

- The negative correlation between posterior and treeLength is also expected - the tree prior has a higher prior density when the branches are shorter, hence the posterior often goes up when the branches are shorter.

Since you want to estimate a node divergence time, then the most important parameters to converge are probably species tree length, posterior, and likelihood, and the height of the node you are dating.

If you have been running for several days and have only achieved 35 million samples, this suggests that a lot can be gained by increasing the runtime (ie. states per hr, which will in turn increase ESS/hr).

Here are some things you can do to make things work a little faster:

1. I suggest you use 1 gamma category. The tree likelihood calculation is very expensive, and if you have N categories, then the chain will run ~N times slower. Even though the model will likely have a slightly worse fit, I suggest you use 1 category for this dataset (and then BEAST will run 4x as fast). If you estimate the proportion of invariant sites instead of gamma heterogeneity, this can capture some of the same signal.

2. 5000 bp per loci is a lot. Depending on how conserved these sequences are, this could be costing you a lot of time. See how many site patterns beast reports in the command terminal. If the number of patterns is <<5000, then it's probabaly okay to use them all, but if the number of patterns is close to 5000, then almost every site in an alignment is unique, and requires its own likelihood calculation. If the number of site patterns is quite large, you might want to downsample the sites to improve the runtime.

3. Following from (2), I suggest you manually inspect the alignments and look for poorly aligned regions with lots of gaps. These regions contain noise and inflate the number of site patterns (and therefore make beast run slower). Although it is quite tedious especially with 100 loci, you can gain a lot in terms of both runtime as well as signal by manually correcting poorly aligned regions, or removing them altogether. This can be done using AliView.

4. Combine the chains together. You don't need to aim for ESS>200 in each individual chain - the combined chain is fine. Just make sure they have converged to the same distribution, and then combine them using the logcombiner tool, and inspect in Tracer.

Hope this helps,

Jordan

Else

unread,

Jul 25, 2022, 2:24:42 PM7/25/22

to beast-users

Thank you for the advice! I switched from 4 gamma categories to 1 gamma category with estimating proportion invariant, and it is running about three times faster now. I am combining 16 runs together, and after two days they reached ESS=100 for the posterior when combined, so I am crossing my fingers they will eventually reach stationary.