Time calibration issue

58 views
Skip to first unread message

victor piñon

unread,
Apr 25, 2025, 4:21:45 AMApr 25
to PAML discussion group

Hi,

I am trying to calibrate a tree using MCMCTree, following this tutorial: https://gensoft.pasteur.fr/docs/paml/4.9j/MCMCtree.Tutorials.pdf.

My alignment includes 153 amino acid gene sequences. I constructed the tree with IQTree and added the fossil calibrations as described here: https://github.com/sabifo4/Tutorial_MCMCtree/blob/main/00_data_formatting/scripts/Include_calibrations.R.

The resulting tree was:
(Gve,((Dli,(Sca,Nve)),(Ppe,((Ryu,Rfl),((Pve,(Mme,Dcy)'B(.66,1.65,1e-300,0.025)'),(Pcy,(Ssi,Ami)))))))'>4.70<5.38';)

Using this tree, I first ran usedata = 0, and everything appeared fine when checking in Tracer. I then ran usedata = 3 to estimate amino acid substitution rates and obtain the in.BV file, followed by usedata = 2 using in.BV.

The issue is that with this setup, my tree becomes incorrectly calibrated — the root age is estimated between 1300 and 1500 million years ago (Mya), which is clearly too old.

After carefully reviewing the tutorial, I noticed that the example tree:
((((human, (chimpanzee, bonobo)) '>.06<.08', gorilla), (orangutan, sumatran)) '>.12<.16', gibbon);
places the root calibration at the ingroup node, not on the outgroup node.

So I modified my tree to place the root calibration on the ingroup node instead:

(((Dli,(Sca,Nve)),(Ppe,((Ryu,Rfl),((Pve,(Mme,Dcy)'B(0.066,0.165,1e-300,0.025)'),(Pcy,(Ssi,Ami))))))'>4.70<5.38',Gve);

This time, the calibration "worked"; the ingroup was dated between 470 and 538 Mya, but the outgroup was inferred to be even older, between 650 and 720 Mya. However, this is still incorrect, as the outgroup's maximum age should be 538 Mya.

I think I’m missing something here. Could you please explain where my mistake is? I attached here the alignment and the mcmc.ctl in case the error is there.

When this works, I will change the root calibration to softbonds, but I want to learn how to properly add fossil calibrations first. 

Thanks a lot for your time.

Victor

mcmc.ctl
concatenated_blocks.phy

Sandra AC

unread,
Apr 25, 2025, 6:03:23 AMApr 25
to PAML discussion group
Hi Victor,

Thanks for your message!

I think the issue may lie in the text file you are using to specify the node age constraints as well as in your input tree file. The R script that you are using takes an input file where users need to specify the node age constraints following a specific format. E.g.: you can check the "calibrations.csv" file that I have in my tutorial, which format follows some rules (they are also written in the R script):
  • Header
  • One row per calibration
  • No spaces at all, semi-colon separated
  • There are a total of 4 columns:
    • Name of calibration (no spaces!)
    • Name of tip 1 that leads to MRCA (no spaces!)
    • Name of tip 2 that leads to MRCA (no spaces!)
    • Calibration in `MCMCtree` format. Please check the PAML documentation for more details on how to set up node age constraints!
Please note that the first tree you have shared has an additional parenthesis after the semicolon, which I believe may just be a typo (but, if not, make sure you remove it as this may be causing issues when running MCMCtree!):

```

(Gve,((Dli,(Sca,Nve)),(Ppe,((Ryu,Rfl),((Pve,(Mme,Dcy)'B(.66,1.65,1e-300,0.025)'),(Pcy,(Ssi,Ami)))))))'>4.70<5.38';)
```

Assuming that the last parenthesis is a typo, you are constraining the root age with a soft-bound calibration that has a minimum age of 4.70x100Myr and 5.38x100Myr. This means that, in the text file that you may have used to run the R script, you may have written a row such as the following (`Dli` could be any other tip as it would lead to the root age when specifying `Gve` as the other tip):

```
root;Gve;Dli;'>4.70<5.38'
```

Please note that you have also specified a root age constraint in your control file (i.e., RootAge = 5.38), but the format you have used is incorrect: you are actually fixing the root age to be 5.38x100 Myr (i.e., you are not using a calibration density). I believe you wanted to use an upper-bound calibration (?) and write '<5.38' (or 'U(5.38)') instead. As explained in the PAML Wiki under bullet point "rootAge", if users have specified a root age constraint both in the tree file and the control file, the root age constraint in the tree file shall prevail. Therefore, when you ran the analysis with the first tree, you never used an upper-bound calibration for the root age constraint: instead, you used a soft-bound calibration (i.e., 'B(4.70,5.38)' or '>4.70<5.38'). If this soft-bound calibration is supposed to be constraining another node age and the upper-bound calibration you had in your control file was indeed supposed to be used as a root age constraint, you will have to correctly identify the tips that lead to such nodes in the text file that my R script uses as aforementioned. E.g.:

```
name;tip1;tip2;MCMCtree
int_calib1;Mme;Dcy;'B(.66,1.65,1e-300,0.025)'
int_calib2;<tip1>;<tip2>; '>4.70<5.38'
root;Gve;Dli;'<5.38'
```

To clarify the confusion that may have stemmed from following the tutorial you mention, please note that there are two node age constraints specified in the input tree file, but the root age constraint is specified in the control file (i.e., RootAge = '<1.0') -- perhaps, this may have confused as you say "places the root calibration at the ingroup node, not on the outgroup node", but the root age constraint is placed right after the last parenthesis and before the ";" that finishes the Newick format; it is just only specified in the control file (this PAML tutorial was written many years ago for an older PAML version).
As recommended in the PAML Wiki and for the most recent PAML versions (i.e., >= v4.10), it is best to always have constraints specified in the tree file, including the root age constraint (see also my comments above regarding your data). To this end, the tree file that you would use in later PAML versions would be the following (while also you would need to remove variable RootAge = '<1.0'  from the control file):

```
7 1
((((human, (chimpanzee, bonobo)) '>.06<.08', gorilla), (orangutan, sumatran)) '>.12<.16', gibbon) '<1.0';   
```

I can see that you have an "in.BV" file, which means you may have generated this file by running `usedata = 3` before -- if you have done this, please note that variables `alpha_gamma` and `kappa_gamma` are not needed in the control file (they will be ignored if not needed, but always best to keep the control file clean); they are only used when the exact likelihood calculation is used (`usedata = 1`). I can see you have many alignment blocks as you have a partitioned AA dataset, and so it is important that the "in.BV" file has the blocks of branch lengths, gradient, and Hessian in the same order as you have your alignment blocks (i.e., the branch lengths, gradient, and Hessian estimated for alignment block 1 should go in the first block in the "in.BV" file, then those for the second alignment block should go after, etc.). You must also use the same tree topology you are using during timetree inference -- if you are using a different dataset and a different tree topology, you must re-estimate the branch lengths, the gradient, and the Hessian. You can find all these details and more in the PAML Wiki.

You may also want to check whether the priors you specified in the control file are sensible and have been specified based on your data. You have some suggestions about how you could find sensible priors throughout the PAML Wiki too. You may also want to check the MCMC settings to make sure that you are collecting enough samples (i.e., you have "nsample = 10000", perhaps you could increase up to 20,000 if you find that the ESS is too low?) and that the sampling frequency is high enough (so far, you are sampling every 10 iterations; perhaps you may need to increase this value if autocorrelation is too high). You can find some more details in the PAML Wiki under bullet point "burnin, sampfreq, nsample". You can check this tutorial on MCMC to get more familiar with these settings, while I suggest you read the paper "A biologist's guide to Bayesian phylogenetic analysis" (Nascimento et al. 2017).

Please let us know whether the suggestions above answer your questions and whether you can now run MCMCtree with the intended node age constraints! If you have any other questions, please let us know :)

All the best,
Sandy
Reply all
Reply to author
Forward
0 new messages