MCMCTree with multiple partitions

Κωνσταντίνος Παπαχρήστος

unread,

Dec 5, 2025, 3:39:31 AM12/5/25

to PAML discussion group

Hey there, I am using PAML 4.8a and I want to estimate time divergence of my taxa using MCMCTree. To do this I got CDS alignments from 2464 genes in 37 taxa. I calculated the Hessian matrix for each partition, creating an BV.out file from baseml and now I am using approximate likelihood estimation to get time estimates. I attach my mcmctree.ctl file so you can take a look if you need to. My problem is that I have been running the analysis for 217 hours so far, and it does not seem to progress. This is the output I have had for days now:

(long stuff about the Hessian Matrix I believe... )
...
lnL0 = -8835591303281708470749405253731826235193950208.00

Starting MCMC (np = 182372) . . .
paras: 36 times, 2464 mu, 2464 sigma2 (& rates, kappa, alpha)
finetune steps (t mu&sigma2 r mixing paras ...): 0.0500 0.0500 0.0500 0.0500 0.0500

I have run the analysis with all the data into one partition as well, just to test and it works, it just does not converge well enough. Does anyone have experience running MCMCTree with lots of data and thousands of partitions? Will this analysis ever finish? Is this normal behaviour or should I be doing something else?

I also want to point out I am running this 6 different times (I got 3 different alpha parameters for rgene_gamma: 1, 5 and 10 to get a feeling of how the estimates behave in faster or slower evolutionary rates, and for each alpha value, I run twice to check for convergence of the MCMC chains). I know we cannot use multiple cores for this, so all the above run at 1 each in my PC.

mcmctree.ctl

Sandra AC

unread,

Dec 5, 2025, 4:10:52 AM12/5/25

to PAML discussion group

Hi there,

Thanks for your message! Have you tried running the latest version of PAML instead of v4.8a? You can find the latest release with pre-compiled binaries (also the source code, should you wish to compile it on your own) on the PAML GitHub repository. In addition, there are some things in the control file you have attached that you may want to double check:

The `seed` variable is commented. You should use `seed = -1` for MCMCtree to generate a random seed number -- which you can later use if you want to reproduce the same results and/or troubleshoot issues with that specific run.
You are using `model = 8` but, as far as I am concerned, there is no such option for nucleotide data. You may want to use perhaps `model = 4` to enable model HKY85.
Option `finetune` is not needed anymore in the latest PAML version (I suggest you use the latest version instead of v4.8a).
If you use `print = 2`, please note that analyses can take much longer to finish as you are also printing the branch rates. I am not sure how long your alignment blocks are (you have `ndata = 2464`, may alignment blocks), which definitely affects how long your analyses are running for (the more partitions, the longer it takes). If you have too many alignment blocks, perhaps you should try another partitioning scheme. E.g.: rank your genes from slow- to fast-evolving and then separate them into four alignment blocks with the same number of genes each -- 4 or 5 alignment blocks seem to be a good compromise between computational speed and precision in time estimates, see Álvarez-Carretero et al. 2022, but also dos Reis et al. 2012, dos Reis et al. 2014, and dos Reis et al. 2015.
You also have a very long `burnin`, which is alright if you know that it takes a long time for your chain to finetune itself and find its way through the parameter space. Perhaps you could decrease it a bit and, if you then run MCMC diagnostics and see that your `mcmc.txt` file still has some samples that should be considered as part of the burn-in, you can then remove them.

Just to verify that your settings are alright and you can complete one successful run, I suggest you run MCMCtree first with just one of your alignment blocks (you can also reduce the iterations for the burn-in phase and the sampling frequency as you just want to run this as a test, to make sure that a run can complete). Include variable `seed`, choose a valid model for nucleotide data (variable `model`), remove `finetune` when using the latest version of PAML, and check how long MCMCtree takes to run with `print = 1`. If the analysis finishes successfully, then you can update variables `burnin` and `sampfreq` to sensible numbers based on your test and you may want to run again, this time with `print = 2` so that you can estimate how long it takes to also print the branch rates with one alignment block and sensible options for `burnin` and `sampfreq`. Based on these tests, perhaps you can have a rough idea of how long things will take when you include all alignment blocks and whether it would be best to have a different partitioning strategy such as the one I mentioned above :)

Hope this somewhat helps!

All the best,
Sandy

Κωνσταντίνος Παπαχρήστος

unread,

Dec 10, 2025, 6:08:02 AM12/10/25

to PAML discussion group

Hey Sandy,

Thank you for your feedback! Some points:

Before you responded and after I posted my question, I ran the analysis just as described before, but with 10 genes instead of 2464 and it worked. Completed after around 9h. So this tells me I may have too many partitions for the program to handle. If that worked linearly, 2464 partitions would run in 93 days I suppose!?
In the new tests, I run as you suggested with PAML v4.10.9. I am using print = 2, model = 4 and 5 partitions of slow to fast genes as you suggested in your paper. It runs now 5% per 30 mins, so your suggestion on partitioning really helped!
When first trying PAML v4.10.9 without the finetune line in the ctl file, I got a "Error: steplength = 0 in UpdateTimes" error and exit. When I put the "finetune" line back in the ctl file it worked. So I guess there is something there that one needs to set, I am not sure what and how though. Did not look more into that.
For the model = 8, I was going for one of the GTR models, specifically UNREST, where you have each substitution and having its own probability in the Q matrix (and the reverse substitutions having also a different rate). I just thought a more general model would be better to use. In the manual for PAML in the MCMCTree section, page 48, it says,

"model, alpha, ncatG are used to specify the nucleotide substitution model. These are the same variables as used in baseml.ctl. If alpha ≠ 0, the program will assume a gamma-rates model, while alpha = 0 means that the model of one rate for all sites will be used. Those variables have no effect when usedata = 2".

So I thought I could be using model = 8 for UNREST from baseml for this. The problem is that just to check, I made a run with model = 99, and MCMCTree did not exit/crash. So it seems it can accept invalid values for that, and runs with some default silently (PAML v4.10.9)? Do you know how I can see what actual model MCMCtree runs, because from the output files produced it is not clear to me where this information is?

Cheers,

Kostas

Sandra AC

unread,

Dec 10, 2025, 6:34:41 AM12/10/25

to PAML discussion group

Hi Kostas,

Thanks, great to know that you have something to work with now with the new partitioning scheme! Answering to your last questions:

Based on your tests, perhaps partitioning per gene may take too long to run, so perhaps having 4 to 5 partitions is a good compromise between computational time/resources and accuracy in time estimates. If you have access to a server, you can run many independent chains at the same time if you submit a "job array", which I believe could be more efficient than running individual analyses on your personal computer. You can find a description I wrote some time ago about how to understand and write a bash script to enable a job array via this link and a script that you could use as template and customise so that it fits your analyses (e.g., adapt to the PAML version you are using and your paths, change the flags if you use SLURM as the ones I wrote there are based on an SGE scheduler, etc.).
I am not sure why you are getting an error with the "finetune" variable -- perhaps something to do with the format of the node age constraints you have set in your tree file? I do not know. If you want further troubleshooting with this issue, please attach the control file, input sequence file, and tree file you used when you got this error.
If you estimated the gradient, Hessian, and branch lengths under the UNREST model with BASEML to obtain the so-called "in.BV" file, then that should be fine as variables "model", "alpha", and "ncatG" in the control file do not have an effect when "usedata = 2" -- I think I missed this last time because you did not rename the "out.BV" to "in.BV", but that is OK :)