Query: MCMCTree Running Issues: Multithreading Support and Interrupted Run Recovery

61 views
Skip to first unread message

guo

unread,
Jun 13, 2024, 5:13:52 AMJun 13
to PAML discussion group

Hello!

I am currently using MCMCTree for Bayesian evolutionary analysis and have encountered two issues. I hope to get some advice and solutions from the community.

1. Multithreading Support and Speed Optimization

Does MCMCTree support multithreading? If not, are there any other methods to significantly speed up the runtime? Currently, I am using it in single-thread mode, but the analysis is quite slow due to the large dataset. Any optimization tips or alternative solutions would be greatly appreciated.

2.Interrupted Run Recovery

After running MCMCTree for a month and a half, there was a power outage, and the program was forcibly stopped. Is there a way to resume the run from where it was interrupted, or do I have to start over from the beginning? If it's possible to resume, could you please provide detailed steps or commands on how to do this?
Thank you all for your time and assistance!

Li Guo

Ziheng

unread,
Jun 13, 2024, 5:58:21 AMJun 13
to PAML discussion group
1. there is no multithreaded version of paml.  this is work in progress and we have nothing to offer at this stage.

2. print = -1 instructs the program to summarize the mcmc sample.  you can copy the files to a different folder, and edit the control file to use print = -1.  then run mcmctree.  the program will read the mcmc sample and summarize the sample, rather than running mcmc.  you can also read the sample in tracer and assess convergence.  if you seem to have enough samples (for example, if ESS is not too small, >100 or >200), there won't be a need to run a longer chain.  note that the number of samples you specify in the control file may be too short or too long anyway. 
you can search in the doc about checkpointing.  
but i think if checkpointing was not used, you won't be able to start the mcmc from where it stopped.  
ziheng  yang

guo

unread,
Jul 9, 2024, 9:49:36 AM (7 days ago) Jul 9
to PAML discussion group

Hello!

Thank you very much for your response. I apologize for the delay in implementing your suggestions due to some server maintenance issues. Here are the steps I followed and some preliminary results:

First, I attempted to set the print parameter to -1 and ran mcmctree. I then checked the FigTree.tre file that was generated, but the results were not satisfactory, as shown in the attached image.

tre.png

Second, I opened the “mcmc.txt” file in Tracer. The results indicated that while the ESS values for most data were well above 200, some values were still below 100. In this case, should I consider lowering the “nsample” value in my control file? Or do I need to ensure that all ESS values are above 200?

tracer.png

Lastly, I studied the MCMCTree tutorials but couldn’t find any information on “checkpointing.” Does this refer to the fossil calibration points I used? Do I need to provide more input or output files to determine where I might have made a mistake?

I couldn't see my reply on the forum, so you might receive duplicate posts from me. If this happens, I apologize for any inconvenience.

Thank you again for your response and assistance, and I apologize for the delay in executing your suggestions.

Li Guo

Sandra AC

unread,
Jul 9, 2024, 10:19:44 AM (7 days ago) Jul 9
to PAML discussion group
Hi Guo,

Thanks for letting us know! Please find below some suggestions to follow which, hopefully, may answer your questions:

1. If you are using `print = -1`, you may need to have the sequences in your alignment file in the same order the corresponding taxa appear in your Newick tree file from left to right. Can you try to order the sequences in your alignment file in such a way and let us know if that works? If not, can you please attach your `mcmc.txt` file, your tree file, your alignment file, and your control file to help us troubleshoot what may have happened? E.g., I believe the following is happening:

Alignment file:

```
4  <num_bp>
sp3  ATGGC...
sp1  ATGGC...
sp5  ATGCC...
sp2  ATGGT...
sp4  ATGGT...
```

Tree file:

```
5 1
((sp1,sp2),(sp3,(sp4,sp5)));
```

Try to arrange the sequences in the alignment file as it follows and use `print = -1` again:

```
4  <num_bp>
sp1  ATGGC...
sp2  ATGGT...
sp3  ATGGC...
sp4  ATGGT...
sp5  ATGCC...
```

2. In your case, you have only run one chain (or you have only loaded one chain in Tracer as per your screenshot). Note that at least you should run two chains so that you can test whether it is possible that the chains have reached your target distribution, but you may need to run more chains (e.g., you may find that the posterior distributions differ between two chains and you will not know which one is closer to your target distribution nor which one may have had some problems). Try to run various independent chains and see if the ESS for such parameters increase (e.g., start with four chains, then increase the number if convergence has not yet been reached). If you let us know what MCMC settings you have in your control file and the model under which you are running your analyses, it will be easier for us to help in case something else goes wrong.

3. You may have missed the most important document: the PAML documentation :) If you read the pages relevant for MCMCtree, you will see that option `checkpointing` has been documented (at the time of writing, page 44). Let us know if `checkpointing` works once you follow the settings thereby described but, if not, please let us know what has gone wrong and what settings and input files you were using to help troubleshoot this issue.

Hope these suggestions help, please let us know if the issues are resolved!

All the best,
Sandra

guo

unread,
Jul 10, 2024, 8:59:11 AM (6 days ago) Jul 10
to PAML discussion group

Dear Sandra,

Thank you very much for your response. I have carefully read your reply and the PAML documentation you provided, which has been very helpful. However, I have a few questions that need your assistance:

  1. I indeed have an issue with the alignment file (mcmcgene.phy) and the tree file (FigTree.tre) not being in the same order. Additionally, I noticed an example of the concatenated matrix format on page 13 of the PAML documentation. I suspect there might be an error in the format of my sequence file. In fact, the "calibration.tree" in my control file was constructed in RAxML from a concatenated matrix of 11 genes, but when running the mcmctree program, I split it into 11 gene matrices instead of using a concatenated matrix. Could this be the issue?

    Considering your suggestion to reorder the sequences, I have 765 species, so this is not a simple task. However, if it is necessary, I will attempt this once I ensure my alignment file is correct.

  2. Due to a power outage, it is possible that none of my chains have completed their run.

  3. I found the documentation on “checkpointing,” but unfortunately, I did not use this parameter in my current run. I will include this parameter in my next run and will report back on the results.

Attached are some files that might help you identify my mistakes. My “mcmc.txt” file is too large, and even when compressed (35M), it exceeds the website's maximum limit of 24M. Would it be possible to send it to your personal email instead?

Thank you again for your response and assistance.


Li Guo

mcmcgene.phy
calibration.tree
mcmctree.ctl

Sandra AC

unread,
Jul 13, 2024, 1:48:37 PM (2 days ago) Jul 13
to PAML discussion group
Hi there,

Thanks for sharing your input files, I have now had some time to troubleshoot what may have happened:

1. I noticed that you had defined twice the root age constraint: once in the control file (i.e., '<1.0') and another time in your calibrated tree file (i.e., '>.6198<.9891' ). The usage of the `RootAge` option in the control file is discouraged, and so it is always best to include your root age constraint in the input tree file. To this end, the first thing that you should do is get rid of this line in the control file. Then, I have seen that you are using the exact likelihood calculation (`usedata = 1`) instead of the approximate likelihood calculation (see dos Reis and Yang, 2011). As you have a large alignment, I highly encourage you to use the approximate likelihood calculation -- otherwise, your analysis may take much longer to finish. There are various tutorials that explain how to enable this feature for timetree inference. I recommend you read the tutorial "Bayesian Molecular Clock Dating Using Genome-Scale Datasets" (dos Reis and Yang, 2022) to understand the steps you need to follow to analyse your dataset much faster with MCMCtree. You can also access the GitHub repository `divtime`, maintained by Mario dos Reis, if you want to use the example datasets that are described in the tutorial while you read it. Lastly, if you have the `mcmc.txt` file with the samples collected so far, you can generate what we call a "dummy" alignment with the species ordered in the same way they appear in the tree and a few characters (e.g., two nucleotides). When you enable `print = -1`, MCMCtree will check that the taxa names in your input sequence file are also present in your input tree file. It does not matter how many characters there are in this dummy alignment because divergence times will not be estimated under option `print = -1`. If the taxa names match, MCMCtree will then read the samples that were collected during the MCMC and saved in the `mcmc.txt` file (make sure you write the correct name of your file when passed to option `mcmcfile` if you are using a different name, e.g., "Rutaceaemcmc.txt"), summarise them, and match the mean values to the corresponding nodes in your tree file. I have done a test with a dummy alignment and a dummy "mcmc.txt" file that I quickly generated and it works -- I got a `FigTree.tree` output file with the correct topology. There are many ways you can create such a dummy alignment, but you can find below how I quickly did this with bash scripting:

```
# Run from a directory where you have your
# `calibration.tree` file
grep -o '[A-Z][a-z]*_[a-z]*' calibration.tree | awk -F "\n" {'print $1 "    AT"'} >> dummy_aln.phy
sed -i 's/Clausena\_anisum/Clausena\_anisum\-olens/' dummy_aln.phy
sed -i 's/Haplophyllum\_alberti/Haplophyllum\_alberti\-regelii/' dummy_aln.phy
sed -i 's/Melicope\_lunu/Melicope\_lunu\-ankenda/' dummy_aln.phy
sed -i 's/Zanthoxylum\_clava/Zanthoxylum\_clava\-herculis/' dummy_aln.phy
sed -i '1i 767 2' dummy_aln.phy
```

I have attached this dummy alignment, the dummy "mcmc.txt" file I generated, your tree file, and the control file I used for my tests, in case this helps you run your own tests and understand how to format your input data. Please note that I have low values for `burnin`, `nsample`, and `sampfreq` in the attached control file as I wanted to quickly troubleshoot this problem. When you run MCMCtree for timetree inference with your dataset, please do not use these settings. You can keep those you had, although perhaps you may want to increase your sampling frequency a bit as 10 is quite low.

2-3. That's a shame, I am very sorry to read that... :( Hope that the next runs you start have the `checkpointing` option enabled and, if something similar was to happen, you would always be able to restart the chains from the last time the image was taken.

Hope this helps!

All the best,
Sandra
mcmctree.ctl
dummy_aln.phy
mcmc.txt
Reply all
Reply to author
Forward
0 new messages