Several questions - MCMCTree/BASEML

Ziv Lieberman

unread,

Apr 22, 2025, 11:06:43 PMApr 22

to PAML discussion group

Hi all,

Thanks for the approval into the group! I've got several questions about configuring the control file for divergence dating in MCMCTree using the approximate likelihood method. I've looked through the documentation and tried searching the group but still have some uncertainties.

1. I see that BASEML allows the use of the GTR substitution model using model = 7, and that one can use this in MCMCTree if using the approximate likelihood, because that will call BASEML. How does one specify priors on the stationary frequencies and exchangeabilities? I only found specifications for κ or other parameters for HKY and simpler models.

2. When setting rgene_gamma, the example .ctl file says the gamma should be specified as <shape> <scale>. I am only hesitant because in the R snippet on how to inform these values using tree height and root age gives a shape and a <rate>, rather than a scale. Should I really use the scale (inverse rate parameter) here? Can I simply estimate a good rate parameter using the R example, then take 1/that value?

Relatedly, what is a 'sane' range for such a gamma distribution to have, e.g., using my selected scale param (or equivalent rate), the range is <1.0.

3. Is the alpha parameter for gamma-distributed ASRV a fixed value, or just an initial that will be sampled? Is there a way to specify a prior on alpha?

Finally, a broader question about using the conditional vs multiplicative construction for the BDSP with node calibrations. The PAML documentation suggests using the multiplicative, while my reading of Barba-Montoya et al. 2019 Mol. Phylog. Evol. 114 suggests that the conditional approach may suffer less severely from the effects of truncation. Obviously the thing to do is to try both methods with a few replicates in a data-free run (which I will need to do anyway to assess effective priors). Still, does anyone have any further recommended reading or thoughts on this topic besides what can be found in Heled & Drummond 2012 Syst. Biol. 61? I have read dos Reis et al. 2014 Syst. Biol. 63, but I am using an unpartitioned phylogenomic dataset.

Thank you for any help!

Cheers,

Ziv Lieberman

Sishuo Wang

unread,

Apr 23, 2025, 11:37:27 PMApr 23

to PAML discussion group

Hi Ziv,

Thanks for your Qs! These are very detailed and clear. I'll tentatively answer 1-3 and leave the last, which is more difficult and which as i know has some new updates, for others particularly sandra.

1. I am unsure whether there's a way to do it for baseml. Could i ask why you intend to specify those params? There could be a way in codeml, where you specify those values in a new .dat file and import it to codeml. Typically you can just use GTR which will automatically determine all subs rate params as calculated by other programs.

2. I think it's shape and scale as mentioned on pp. 3 in this tutorial, Sandra could u pls confirm which one is correct?

3. You can fix alpha for which you may also want to have a look at this one. However, I don't think you can specify a prior for alpha directly. But if you really want to do it in some way which however I think is likely unnecessarily complex, you could run mcmctree multiple times using different fixed alpha values according to P(alpha_that|D) and sample the posterior times. Note the following

P(alpha|D) ∝ P(alpha)*P(D|alpha).

where

P(D|alpha) can be approximated by say a gamma distribution whose params can be roughly estimated by bootstrap (you bootstrap the alignment sites for 100 times and calculated the MLE of alpha each time). For an entire paml-mcmctree analysis (the approx lik method; unsure about exact lik), alpha is estimated by codeml and not updated during mcmctree posterior time and rate estimation.
P(alpha) is your prior.
D is your alignment

So then, you could obtain P(alpha|D) by monte carlo, which can be used to calculate P(T|D), according to the law of total prob, by

P(T|D) = Σ P(T|D, alpha)*P(alpha|D)

Note P(T|D, alpha) can be calculated by fixing the alpha in the control file. The above notation is not stringent but shouldn't be difficult to understand. My apology for inconvenience.

best,

sishuo

Ziv Lieberman

unread,

Apr 24, 2025, 12:18:26 AMApr 24

to PAML discussion group

Hi Sishuo,

Thank you for your very quick and detailed reply! This is extremely helpful.

1. As long as the program will calculate the necessary parameters, then I suppose there is no real need to specify priors myself. I only thought that because there were ways to put priors on, e.g., kappa in HKY85, that one would need to do the same with GTR. So, I will go ahead and just specify model = 7.

2. Thank you for checking. I suspect it is indeed as written, but because I have a hard time visualizing or intuitively understanding what 'sensible' values would be, I didn't satisfy my uncertainty by just plotting the example given under both parameters.

3. Thank you for the in-depth suggestion! I am not highly committed to setting a prior on alpha, but just wanted to understand better which values from the control file would be fixed, sampled, or drawn from priors. However, I will hold on to this suggestion, as it could very well be useful in the future!

Looking forward to any further discussion, and I'm sure I'll have more questions as I start to actualize these runs ;)

Cheers,

Ziv

Sandra AC

unread,

Apr 24, 2025, 9:57:22 AMApr 24

to PAML discussion group

Hi Ziv and Sishuo,

Thanks for the great discussion! While all of Ziv's questions have been answered except for the latter, please find some more details for each of the questions which, hopefully, can be of help to everyone in the PAML community!

1. BASEML and CODEML are ML programs, and thus they are going to estimate the MLEs for all model parameters. When enabling the approximate likelihood calculation with MCMCtree, users can estimate the branch lengths, the gradient, and the Hessian matrix needed by MCMCtree to approximate the likelihood calculation (i.e., these are the vectors and matrix that you see in the so-called `in.BV` file) under any of the evolutionary models implemented in BASEML (nucleotide data) or CODEML (amino acid data). In addition, if you had various partitions in your sequence alignment file, you could also analyse them under different models in BASEML/CODEML so that every partition has the branch lengths, gradient, and Hessian estimated under e.g. the corresponding best-fitting evolutionary models. While the PAML Wiki has an extensive section explaining how to enable the approximate likelihood calculation, there are also many tutorials showing how to do this step by step. E.g: the book chapter Bayesian Molecular Clock Dating Using Genome-Scale Datasets guides users through the analysis of phylogenomic datasets with MCMCtree. This is the protocol paper you may want to check if you want to follow step-by-step guidelines to learn how to make the most out of the approximate likelihood calculation implemented in MCMCtree to analyse large datasets. All input and control files required to follow the examples given in the protocol paper can be accessed via the divtime GitHub repository. We have also written a new tutorial for amino acid datasets that you can check in the paml-tutorial GitHub repository -- please note that this tutorial is subject to changes as we keep updating it for workshops.

At the moment, there is not a Bayesian approach with which you could incorporate the priors on the exchangeability rates in any of the PAML programs, unfortunately.

2. The R function `dgamma` does not use the `scale` argument for the `beta` parameter, thus why we are specifying the `rate` argument instead -- I understand that this can be confusing, but it is the `rate` parameter in the R function `dgamma` which equals to the `beta` parameter. For those who have not yet visited the PAML Wiki for MCMCtree, Ziv is referring to the code snippet given in the "Overview" section as a suggestion to come up with sensible gamma priors.

3. In the control file, variables `alpha` and `ncatG` are two variables used to specify the gamma distribution that will be used to account for rate heterogeneity. The shape of the gamma distribution is defined with variable `alpha`, while the number of categories to discretise such a distribution are specified with variable `ncatG`. If α≠0 (positive greater than 0), the program will assume a gamma-rates model, while α=0 means that the model of one rate for all sites will be used instead. If the gamma-rates model is used, then variable `ncatG` specifies the number of categories with which the gamma distribution will be discretised.
During the approximate likelihood, CODEML/BAESML will therefore use the value you set in variable `alpha` as a starting value to estimate the MLE of alpha – you will see this at the end of the `tmp*.out` file. E.g.: relevant output when running BASEML with a small dataset specifying HKY85+G4; you can see that the MLE for alpha is 0.43687:

```
Parameters (kappa) in the rate matrix (HKY85) (Yang 1994 J Mol Evol 39:105-111):
2.78171

alpha (gamma, K=4) = 0.43687
rate: 0.02237 0.20873 0.76829 3.00061
freq: 0.25000 0.25000 0.25000 0.25000
```

4. To give some context to the PAML community subscribed to this discussion group, the latest PAML release (PAML 4.10.8, which we released a few days ago) now implements the multiplicative construction alongside the conditional construction for the birth-death process with species sampling (i.e., all details regarding how to enable one approach or the other are given in the PAML Wiki for MCMCtre under the bullet point for variable `BDparas`).

As specified in the PAML Wiki, both the conditional and the multiplicative constructions are incorrect:

The component "f(tC)" appears twice in the multiplicative construction and does not follow the rules of probability calculus, which has also been reported by Heled and Drummond (2019).
The conditional prior "f(tC|t1)" of Yang and Rannala (2006) is misspecified because labelled histories are not distinguished from rooted trees. This issue was pointed out by dos Reis (2016).

Regardless of the approach used (multiplicative or conditional), the density for calibrated nodes is affected by truncation. What Barba-Montoya 2018 showed is that the conditional construction leads to marginal priors (i.e., "effective priors") that are closest to the original calibration densities (i.e., "user-specified priors") when truncation is minimal. When truncation is severe, however, effective priors will differ substantially from the user-specified priors in both the multiplicative and the conditional construction. Nevertheless, while the density for non-calibrated nodes is affected by the aforementioned mistake in the conditional approach, the multiplicative approach does not suffer from this issue. Consequently, on balance, we recommend that the multiplicative approach be used (as it generates more reasonable densities for non-calibration nodes; this is now the default option too in the control file) and that users check and confirm that the marginal densities for the "effective priors" (after the program has applied the truncation) are biologically reasonable. Consequently, running the program without data to check whether the marginal densities are reasonable must always be done before analysing the data, regardless of using the conditional or the multiplicative construction. If some of those priors look biologically unrealistic, users should adjust the calibration densities in the tree file, re-run the program, and check again. This advice would not only apply to MCMCtree, but also to any other Bayesian dating program that can be used for timetree inference. All this information can also be found in the PAML Wiki :)

I hope the answers above somewhat clarify your questions but, if not, please let us know!

All the best,
Sandy

Ziv Lieberman

unread,

Apr 24, 2025, 3:13:48 PMApr 24

to PAML discussion group

Hi Sandy,

Thank you so much for expanding on these subjects! Of course, now that you go into more detail about the approximate likelihood message, it should have been apparent that priors weren't relevant :D But overall, thank you as well for your guidance through the very thorough (but dense) documentation. The clarification on 'scale' vs 'rate' makes sense - I will use the dgamma 'rate' as beta in the control file.

Your comments on Barba-Montoya are also really useful. Re-reading it with your points in mind, I definitely understand better the costs and benefits to the different methods.

There is much more to ponder here, but this is really invaluable advice, thank you again!