Speeding up mcmctree

376 views
Skip to first unread message

Tiago Simoes

unread,
Aug 18, 2023, 1:25:08 PM8/18/23
to PAML discussion group

Hi everyone,

My understanding is that there still is no parallelized version of PAML, correct? I also have seen no improvement of computational time by increasing requested RAM memory in my cluster. Any tips on how to improve computational time using HPC for genomic scale nucleotide data (besides of course using approximate likelihood population with baseml/mcmctree)? At this point it basically runs at the same speed as my personal laptop.

Regards,

Tiago 

Sishuo Wang

unread,
Aug 24, 2023, 1:24:18 AM8/24/23
to PAML discussion group
Hi Tiago,

at the software level, what about https://www.mdpi.com/2073-4425/13/6/1090? not familiar w/ it, though.

sishuo

Tiago Simoes

unread,
Aug 24, 2023, 9:03:33 AM8/24/23
to PAML discussion group
Thanks, Sishuo. It works for  codeml functionalities only, but it's interesting!

Sandra AC

unread,
Aug 25, 2023, 3:52:19 AM8/25/23
to PAML discussion group
Hi Tiago, 

I suggest you use the approximate likelihood calculation implemented in MCMCtree (dos Reis and Yang, 2011) if you want to speed up your analyses. You can read the book chapter Mario dos Reis and Ziheng Yang wrote some years ago (dos Reis and Yang 2019), which goes together with the GitHub repository "divtime".

If you are to run multiple independent chains on an HPC, you should definitely customise your bash scripts so that they run a so-called "job array" (i.e., same task is run in different computers "in parallel" in the cluster you are using, and so it is quicker than using a "for" loop). You may need to modify the flags of your bash scripts depending on whether your HPC has a SLURM/SGE scheduler. As every cluster is different, you may want to get in contact with those in charge of managing and maintaining the HPC you are using for assistance with that (e.g., memory allocation, nodes usage, etc.). You will need to think of a file structure that works nicely with your environment and dataset. If you want some ideas, you can read a general workflow to set up job arrays I wrote some time ago. While focused on using CODEML, you can adapt the template script I provide in that section to run a job array for any other task with any other program.

Cheers!
S.

Tiago Simoes

unread,
Aug 25, 2023, 9:22:56 AM8/25/23
to PAML discussion group
Dear Sandra,

Thank you for your feedback and references. I am familiar with the approximate likelihood approach and array jobs, so it seems that there is no other way to speed up other than to split the data into multiple partitions as a way to "parallelize"  things. 

Thanks again,

Tiago 

Sandra AC

unread,
Aug 25, 2023, 11:41:52 AM8/25/23
to PAML discussion group
Unfortunately, MCMCtree can only run at the moment as a single-core job and, as you have already mentioned, it cannot be parallelised -- had missed the part where you mentioned you had tried the approx. lnL, sorry! Increasing the RAM will not speed things up either with MCMCtree, although you may need to increase that when running CODEML/BASEML for gradient and Hessian calculation with large datasets. 
As an alternative, you may want to use the Bayesian Sequential-Subtree (BSS) approach, which combines the usage of the approximate likelihood with a "backbone tree + subtrees" method under a Bayesian framework to speed up clock-dating analyses with MCMCtree. You can find a step-by-step tutorial on my GitHub repository, which complements our paper. While we applied the method to infer a mammal evolutionary timeline, the method can be used with other datasets :)

Nevertheless, not sure what you mean by <split the data into multiple partitions as a way to "parallelize"  things>. If you partition your molecular alignment (e.g., codon position schemes, slow- to fast-evolving genes, different data types, etc.) and include each partition as "alignment blocks" in a unique alignment file (i.e., `ndata = X` in the control file, where `X` is the number of alignment blocks in the alignment file you provide via option `seqfile`), MCMCtree will take longer to run. If you analyse each alignment block as a "separate data subset" and individually with MCMCtree, you will infer time estimates for each alignment block (or data subset) separately. MCMCtree will indeed be faster as you will have divided your main dataset into X data subsets, but you should not average the time estimates you obtain with the independent analyses as the results will correspond to analyses with different data subsets. E.g., say you divide your main molecular alignment into two data subsets by randomly dividing the sequences you have into two halves. You would then infer the divergence times for each data subset separately while fixing the same tree topology (unless you want to evaluate different tree hypotheses), which will result in estimated divergence times for the first data subset and then for the second data subset, but you should not average those. I just wanted to clarify this in case other PAML users read this thread of messages :)

Good luck with your inference analyses!
S.

Tiago Simoes

unread,
Aug 25, 2023, 2:02:24 PM8/25/23
to PAML discussion group
Thanks again, Sandra!

For codeml/baseml, good to know that increasing memory will speed up things for these! BTW, I know this is a bit off the original question, but how is it determined for how long baseml analyses will take to finish running? There are multiple rounds being written on the output, but it's not clear to me what should be the total number of rounds since we do not set a the number of iterations in the baseml.ctl file.

Cheers!

Tiago

Sandra AC

unread,
Aug 28, 2023, 4:15:38 AM8/28/23
to PAML discussion group
Hi Tiago, 

What I meant is that you may need to increase the RAM due to CODEML/BASEML needing to allocate more memory to calculate the gradient and the Hessian when having large genomic datasets -- sometimes users may get a "Bus error" or similar due to the RAM allocated not being enough for the size of the dataset being analysed, but this does not mean that there is a problem (users just need to increase the allocated RAM).  Increasing RAM, however, will not speed things up. Users may submit a "test run" with e.g. 1Gb or 2Gb RAM and then, if the analyses are terminated due to lack of RAM, they can use the log file and sum stats obtained by the job to extrapolate how much RAM may be needed -- sometimes, it ends up being a trial-error process. I would always encourage users to get help from research software engineers or other researchers in charge of managing/maintaining the HPC they are to use to ensure good practice prior to submitting any jobs :)

Not sure how you can estimate in advance how many rounds it will take for CODEML and BASEML to finish. I may be wrong but, given that they are ML-based programs, I believe they will keep running one iteration after another until the algorithm converges. I guess that the more "complex" the dataset is, the longer it will take for the software to finish the calculations and reach convergence... You will find more details about `method = 0` and `method = 1`, the info printed in the `rub` output file and how it can help debug possible convergence problems, and details about how to specify initial values in the PAML documentation under the "Miscellaneous notes" section. Hope that somehow this helps!

All the best,
Sandra

Tiago Simoes

unread,
Aug 28, 2023, 10:27:23 AM8/28/23
to PAML discussion group
Hello Sandra,

Thank you for the extra information and clarifications.

Kind regards,

Tiago
Reply all
Reply to author
Forward
0 new messages