Hi Ziv and Sishuo,
Thanks for the great discussion! While all of Ziv's questions have been answered except for the latter, please find some more details for each of the questions which, hopefully, can be of help to everyone in the PAML community!
1. BASEML and CODEML are ML programs, and thus they are going to estimate the MLEs for all model parameters. When enabling the approximate likelihood calculation with MCMCtree, users can estimate the branch lengths, the gradient, and the Hessian matrix needed by MCMCtree to approximate the likelihood calculation (i.e., these are the vectors and matrix that you see in the so-called `in.BV` file) under any of the evolutionary models implemented in BASEML (nucleotide data) or CODEML (amino acid data). In addition, if you had various partitions in your sequence alignment file, you could also analyse them under different models in BASEML/CODEML so that every partition has the branch lengths, gradient, and Hessian estimated under e.g. the corresponding best-fitting evolutionary models. While the PAML Wiki has an extensive
section explaining how to enable the approximate likelihood calculation, there are also many tutorials showing how to do this step by step. E.g: the book chapter
Bayesian Molecular Clock Dating Using Genome-Scale Datasets guides users through the analysis of phylogenomic datasets with MCMCtree. This is the protocol paper you may want to check if you want to follow step-by-step guidelines to learn how to make the most out of the approximate likelihood calculation implemented in MCMCtree to analyse large datasets. All input and control files required to follow the examples given in the protocol paper can be accessed via the
divtime GitHub repository. We have also written a new tutorial for amino acid datasets that you can check in the
paml-tutorial GitHub repository -- please note that this tutorial is subject to changes as we keep updating it for workshops.
At the moment, there is not a Bayesian approach with which you could incorporate the priors on the exchangeability rates in any of the PAML programs, unfortunately.
2. The R function `dgamma` does not use the `scale` argument for the `beta` parameter, thus why we are specifying the `rate` argument instead -- I understand that this can be confusing, but it is the `rate` parameter in the R function `dgamma` which equals to the `beta` parameter. For those who have not yet visited the
PAML Wiki for MCMCtree, Ziv is referring to the code snippet given in
the "Overview" section as a suggestion to come up with sensible gamma priors.
3. In the control file, variables `alpha` and `ncatG` are two variables used to specify the gamma distribution that will be used to account for rate heterogeneity. The shape of the gamma distribution is defined with variable `alpha`, while the number of categories to discretise such a distribution are specified with variable `ncatG`. If α≠0 (positive greater than 0), the program will assume a gamma-rates model, while α=0 means that the model of one rate for all sites will be used instead. If the gamma-rates model is used, then variable `ncatG` specifies the number of categories with which the gamma distribution will be discretised.
During the approximate likelihood, CODEML/BAESML will therefore use the value you set in variable `alpha` as a starting value to estimate the MLE of alpha – you will see this at the end of the `tmp*.out` file. E.g.: relevant output when running BASEML with a small dataset specifying HKY85+G4; you can see that the MLE for alpha is 0.43687:
```
Parameters (kappa) in the rate matrix (HKY85) (Yang 1994 J Mol Evol 39:105-111):
2.78171
alpha (gamma, K=4) = 0.43687
rate: 0.02237 0.20873 0.76829 3.00061
freq: 0.25000 0.25000 0.25000 0.25000
```
4. To give some context to the PAML community subscribed to this discussion group, the latest PAML release (
PAML 4.10.8, which we released a few days ago) now implements the multiplicative construction alongside the conditional construction for the birth-death process with species sampling (i.e., all details regarding how to enable one approach or the other are given in the
PAML Wiki for MCMCtre under the bullet point for variable `BDparas`).
As specified in the
PAML Wiki, both the conditional and the multiplicative constructions are incorrect:
- The component "f(tC)" appears twice in the multiplicative construction and does not follow the rules of probability calculus, which has also been reported by Heled and Drummond (2019).
- The conditional prior "f(tC|t1)" of Yang and Rannala (2006) is misspecified because labelled histories are not distinguished from rooted trees. This issue was pointed out by dos Reis (2016).
Regardless of the approach used (multiplicative or conditional), the density for
calibrated nodes is affected by truncation. What
Barba-Montoya 2018 showed is that the conditional construction leads to marginal priors (i.e., "effective priors") that are closest to the original calibration densities (i.e., "user-specified priors") when truncation is minimal. When truncation is severe, however, effective priors will differ substantially from the user-specified priors in both the multiplicative and the conditional construction. Nevertheless, while the density for
non-calibrated nodes is affected by the aforementioned mistake in the conditional approach, the multiplicative approach does not suffer from this issue. Consequently, on balance, we recommend that the multiplicative approach be used (as it generates more reasonable densities for non-calibration nodes; this is now the default option too in the control file) and that users check and confirm that the marginal densities for the "effective priors" (after the program has applied the truncation) are biologically reasonable. Consequently, running the program without data to check whether the marginal densities are reasonable must always be done before analysing the data, regardless of using the conditional or the multiplicative construction. If some of those priors look biologically unrealistic, users should adjust the calibration densities in the tree file, re-run the program, and check again. This advice would not only apply to MCMCtree, but also to any other Bayesian dating program that can be used for timetree inference.
All this information can also be found in the PAML Wiki :)
I hope the answers above somewhat clarify your questions but, if not, please let us know!
All the best,
Sandy