StarBeast 2 Clock Rates

173 views
Skip to first unread message

Dylan O'Hearn

unread,
Oct 17, 2022, 6:56:40 PM10/17/22
to beast-users
Hello,

I'm running a species tree analysis using Starbeast 2- I would use Starbeast 3 but bmodeltest isn't yet available, and anyway I only have 4 loci.

However, I'm not sure how to go about setting clock rates.  To be more specific, I'm using the SpeciesTreeUCLN template, and I have 7 partitions, 4 of which share a tree.

Should I be checking "estimate" on the clock models?  Without getting into too much detail, I'm confused by the differences between the template's default settings, the recommendations in the starbeast2 tutorials, and what is appropriate for original starbeast.

Thanks for your help!

Jordan Douglas

unread,
Oct 19, 2022, 10:02:27 PM10/19/22
to beast-users

Hi Dylan,

I just added bModelTest to starbeast3 today. You will need to update to the latest versions of starbeast3 and bmodeltest (and be using beast 2.7).

Regarding clock rates, these are the clock rates of the gene trees (1 per gene tree). By default these are all fixed at 1 meaning that all gene trees have the same clock rate. Oftentimes people fix the first clock rate at 1 and estimate all of the rest so that they do not interfere with the tree height estimates and the starbeast2 taming the beast tutorial suggests linking clock rates together. Overall, the clock rate of a gene branch b is:

clockRateGeneTree x clockRateSpeciesBranch(b)

where one branch rate is estimated per species branch under the species tree relaxed clock model, and the rate of branch b is determined by the species branch(es) which contain b.

I hope this is helpful. Configuring the clock model ultimately depends on what you want to achieve so let me know if you have any further details on what you are aiming for

Jordan

Dylan O'Hearn

unread,
Oct 21, 2022, 1:36:03 PM10/21/22
to beast-users
Oh, and I should say my goals are to perform divergence dating, followed by species delimitation using speedemon.  However I'm currently doing a normal run without any calibrations just to see which nodes are highly supported.

Dylan O'Hearn

unread,
Oct 21, 2022, 1:36:03 PM10/21/22
to beast-users
Thanks Jordan, I'm going to go ahead and use Starbeast 3.  One question about that- do you have a recommendation for the number of threads to use?  I have 7 partitions from 4 loci, and 226 taxa making up 13 species.  So I'm assuming either 4 or 7 threads would be appropriate, but my run times are around 90 minutes per million iterations, which seems a lot slower than it should be for such a small dataset.  I know there are many differences but Starbeast 1, for example, takes about 11 minutes per million iterations.  I just want to make sure this isn't some mistake in how I'm running it.

On Wednesday, October 19, 2022 at 9:02:27 PM UTC-5 jdo...@aucklanduni.ac.nz wrote:

Jordan Douglas

unread,
Oct 23, 2022, 5:14:06 AM10/23/22
to beast-users
With 7 gene trees constrained within a species tree, the maximum number of threads is 7. The number of minutes per million states cannot be directly compared with starbeast1, because the parallel operator generates many hundreds or thousands of states every time it is called, but only increases the number of "iterations" by 1 each time. So 90 minutes per million seems okay to me - the most cricual runtime metric is ESS / hr which involves looking at Tracer. Also note that starbeast3 usually starts out slow then suddenly speeds up after the parallelisation optimises.

In terms of clock models, I would suggest giving each partition its own relative clock rate (each with a mean of 1), and then giving the species tree an informed time-calibrated clock rate (ie. subst per unit of time). The clock rate of a gene tree is then the relative rate x the species rate. If you find that the gene tree clock rate estimates do not average 1, this is a non-identifiability issue (because they are correlated with the tree heights), but you can get around this by fixing one of the gene clock rates at 1 in the Gene Clock Model tab.

Dylan Ohearn

unread,
Oct 23, 2022, 3:39:59 PM10/23/22
to beast-users
Hi Jordan,

Thanks a lot for your help!  I just have a couple more questions/clarifications. 

First, I actually have 4 gene trees, with 7 partitions. 4 of the partitions are all mitochondrial and hence share the same tree, and the same clock model.  I have them partitioned by codon position, plus a noncoding partition, to allow for separate site models.  But anyway, it sounds like 4 threads is the correct number.

Second, I'm unsure of how estimating substitution rates fits into what you've described.  I want to account for the differences between the mtdna codon partitions, and estimating substitution rates seems like the way to do that, but will that cause a non-identifiability issue if I then also give the mtdna gene tree its own clock rate?

Third, by default "automatic set clock rate" and "automatic fix mean substitution rate flag" are deselected; do these have any role to play in Starbeast 3?

And fourth, when you say to give the species tree an informed time-calibrated rate, would that involve fixing this rate or estimating it and setting an appropriate prior?

I apologize for having so many questions; I felt I had a good grasp of the substitution and clock rate situation with regular BEAST2 and Starbeast, but adding in the species tree clock is throwing me for a curve.

Dylan

Jordan Douglas

unread,
Oct 24, 2022, 5:14:39 PM10/24/22
to beast-users
Hi Dylan,


First, I actually have 4 gene trees, with 7 partitions. 4 of the partitions are all mitochondrial and hence share the same tree, and the same clock model.  I have them partitioned by codon position, plus a noncoding partition, to allow for separate site models.  But anyway, it sounds like 4 threads is the correct number.

Yes in that case 4 is the maximum. If you request more than 4  threads starbeast3 will round down to 4.

Second, I'm unsure of how estimating substitution rates fits into what you've described.  I want to account for the differences between the mtdna codon partitions, and estimating substitution rates seems like the way to do that, but will that cause a non-identifiability issue if I then also give the mtdna gene tree its own clock rate?

As long as you have strong constraints on either the tree heights or clock rates (or both) which prevent the rates from wandering to infinity and the heights wandering to 0, or vice versa, then you should be okay. If you open tracer and view the joint distribution between a tree's height and its clock rate, you should see the banana shaped correlation between the two. If clock rate or tree height (or length) estimates are spanning multiple orders of magnitude then this indicates a non-identifiability issue. This is sometimes a trial and error experiment, and the only reason I brought it up is because you have just 4 gene trees so there might or might not be enough signal to get around this issue without adding further constraints (eg. fixing a clock rate at 1). Just something to keep an eye on.


Third, by default "automatic set clock rate" and "automatic fix mean substitution rate flag" are deselected; do these have any role to play in Starbeast 3?

These are best left unselected in the case of starbeast3 since it allows individual gene clock rates to be turned on and off at will. Restoring these settings to true can give unwanted behaviour.


And fourth, when you say to give the species tree an informed time-calibrated rate, would that involve fixing this rate or estimating it and setting an appropriate prior?

Best to estimate it with an informed prior which will capture your uncertainty about the parameter. There is a good guide here for this process: https://beast2-dev.github.io/hmc/hmc/Priors/ClockPrior/

If you have nuclear and mitochondrial partitions, you might want to have one clock rate for the nuclear and one for the mitochondria. For example suppose that you believe that the mitochondrial mutation rate is 10x the nuclear rate. Then you could set the species tree rate to A, give all of the nuclear partition clock rates a prior with a mean of 1, and give mitochondrial partition clock rates a prior with a mean of 10. This way the nuclear partitions will on average have a clock rate of A and the mitochondrial 10A. There are a lot of different ways you could parameterise this, depending on your prior knowledge about the system being studied

I apologize for having so many questions; I felt I had a good grasp of the substitution and clock rate situation with regular BEAST2 and Starbeast, but adding in the species tree clock is throwing me for a curve.

No problem happy to help :)


Jordan

Dylan O'Hearn

unread,
Oct 25, 2022, 2:20:16 AM10/25/22
to beast-users
Thanks again Jordan.

Every time I try to run an analysis with more clock or site models than trees (e.g., 4 mtdna partitions with separate site and/or clock models but a single tree), it runs until the chains optimize and then I get an error soon after: "Eigenvalues not converged for Q-matrix". 

In the end I just switched to a single mtdna partition, which seems to run fine.  Is Starbeast3 just not intended to account for codon position etc.?  Or is it maybe something about my data?  I'm just concerned that not including the codon variation might bias my estimates of divergence times.

Jordan Douglas

unread,
Oct 25, 2022, 2:41:48 AM10/25/22
to beast-users
Hi Dylan,


I have seen this issue before and I believe it is related to bModelTest. I don't think it's related to the codons positions (and starbeast3 should be fine with codon positions). Last time this happened to me it was from some of the amino acid frequencies being 0 which gave unwanted numerical issues to the nucleotide transition Markov model. In this case it could possibly be related to the substitution model rates/nucleotide frequencies wandering off into a strange space where the substitution matrix Q cannot be calculated. This could possibly be due to a bad operator generating illegal states (ie a bug), or it could be due to the model/data. 

Do you notice anything unusual about the mcmc chain when it gets to this point (eg. very high / low substitution rates or nucleotide frequencies)? And what substitution models are being inferred by bModelTest? Is it an error or just a warning? If error could you please paste the whole error message?


Thanks,
Jordan

Dylan O'Hearn

unread,
Oct 25, 2022, 5:31:12 PM10/25/22
to beast-users
Hi Jordan,

It doesn't actually say the word error, but it aborts the run.  I've added the message to the bottom of this post.  There are also warnings like this before the run actually begins, but I don't know if those matter: "Warning:Tree Tree.t:mt_trna all has initial value <= 0, which is not appropriate for a LogTransform"

All the values in the log file before the run stops seem plausible to me, but I could email you the log file if you want.  I've also attached a picture of the modelindicator section. subst.jpg




Eigenvalues not converged for Q-matrix
[NaN, NaN, NaN, NaN]
[NaN, NaN, NaN, NaN]
[NaN, NaN, NaN, NaN]
[NaN, NaN, NaN, NaN]
Something went wrong in a calculation of null
java.lang.ArithmeticException
        at beast.base.evolution.substitutionmodel.DefaultEigenSystem.hqr2(Unknown Source)
        at beast.base.evolution.substitutionmodel.DefaultEigenSystem.decomposeMatrix(Unknown Source)
        at beast.base.evolution.substitutionmodel.GeneralSubstitutionModel.getEigenDecomposition(Unknown Source)
        at beast.base.evolution.likelihood.BeagleTreeLikelihood.setUpSubstModel(Unknown Source)
        at beast.base.evolution.likelihood.BeagleTreeLikelihood.calculateLogP(Unknown Source)
        at beast.base.evolution.likelihood.TreeLikelihood.calculateLogP(Unknown Source)
        at beast.base.inference.CompoundDistribution.calculateLogP(Unknown Source)
        at starbeast3.core.ParallelMCMC.propagateState(Unknown Source)
        at starbeast3.core.ParallelMCMC.doLoop(Unknown Source)
        at starbeast3.core.ParallelMCMC.run(Unknown Source)
        at starbeast3.operators.MultiStepOperator$CoreRunnable.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

Jordan Douglas

unread,
Oct 27, 2022, 4:34:25 AM10/27/22
to beast-users
Hi Dylan,

If I recall correctly, this is only an error if you run with beagle. The default beast java likelihood calculation just gives a warning. If you run as 

~/beast/bin/beast -java file.xml

then it should tolerate the issue and just treat the unconverged matrices as if they were illegal states. But the error/warning could potentiialy suggest something sinister. If you email me the xml file I can take a look to see if anything looks wrong

Jordan

Pei-Wei Sun

unread,
Nov 29, 2022, 10:44:05 PM11/29/22
to beast-users
Hello Jordan
I have also encountered this "Eigenvalues not converged for Q-matrix" error after I checked the "linked-tree" in my .xml file.
Although the .xml file with linked-tree run successfully with -java commands as you recommended, the process is much slower than the model with unlinked-tree (other setting are all same; 24 hours/1M sample for linked-tree and 1 hour/1M sample for unlinked-tree). Also, the ESS is low (<100) even with 100 million MCMC, could you provide me advice for promoting ESS and speed in my data?
Thanks for your help! 
Below is my xml file with linked-tree. I have 51 genes, 71 individuals and 8 taxa. I estimated site model with bmodel test and birth-death model  for species tree. I also calibrated root to a specific time in the prior.

Pei-Wei


jdo...@aucklanduni.ac.nz 在 2022年10月27日 星期四下午4:34:25 [UTC+8] 的信中寫道:

Jordan Douglas

unread,
Dec 5, 2022, 7:49:47 PM12/5/22
to beast-users

Hi Pei-Wei,

Linking the trees will make things much slower because the trees can no longer be parallelised as efficiently. 24hr/million states is very slow! It's probably best to run with the trees unlinked in this case, and which also gives a more realistic realisation of the multispecies coalescent process. This will help get the ESS past 200 in a reasonable timeframe.

In terms of the eigenvalue issue, I am not entirely sure what the cause of this is but I'm looking into it. It seems like one or more of the gene tree substitution models are wandering into a strange part of parameter space, on account of the increased flexibility that comes with the bModelTest substitution model. In the meantime, if you can narrow which partition(s) is causing the error then you could change its substitution model to something else.

Jordan

Dylan O'Hearn

unread,
Dec 9, 2022, 12:05:05 PM12/9/22
to beast-users
Hello,

Jordan was able to further assist me via email and I ended up figuring out how to avoid the error, but I wanted to make sure to post it here incase others read this in the future hoping to find the answer.

The problem is that, for my four mitochondrial partitions that shared one tree (noncoding, plus each codon position for the coding regions), I had four separate clock models.  The solution is that the clock models should be linked.  You account for the different rates among partitions by giving each its own site model, and checking "estimate" for all but one of the mutation rates.  So anyway, once I linked the clock models, it stopped crashing.
Reply all
Reply to author
Forward
0 new messages