is beast capable of handling this phylogenomic matrix?

208 views
Skip to first unread message

David Ortiz

unread,
Mar 10, 2024, 2:53:51 PM3/10/24
to beast-users
Hello:

I am trying to run BEAST 1.10.4 on an un-partitioned matrix with 99 taxa and  nearly 130 thousand DNA positions.
I am using an almost fully fixed phylogeny, reducing noticeably the parameter space to explore, an uncorrelated clock model and 3 calibration points with priors following a normal distribution (deepest nodes).

I am setting it to 50 million generations, sampling every 5000, 6 independent replicates with a single chain each one.

I have tried both a single GTR+I+G model for the whole matrix and a single HKY+I+G, for the whole matrix, intending to reduce parameter space.

The analyses run reasonably fast (~ 3/hours per 1 million generations).

However, neither with the GTR+I+G nor the HKY+I+G models, after 30 million generations, I have achieved convergence among runs, and even mixing within runs, with very low ESS values, and many parameters (MRCA of all topology fixations) showing sharp down or uptrends.

So, at this point, I can come up with a few more things: 1) setting an additional heated chain for each run, to improve mixing, or "dangerously" trying an even more simple model of evolution K80+I+G, F81+I+G....

These tests take a lot of time, and I am not even sure whether BEAST can somehow handle these large matrices, or if I am forced to migrate to something like MCMCTRee.

I would appreciate anyone's experience with this and/or other suggestions.

Cheers,

DAVID

Alexei Drummond

unread,
Mar 10, 2024, 3:51:04 PM3/10/24
to beast...@googlegroups.com
Probably it is because of the very strong correlations between branch rates and divergence times when sequences get very long. BEAST2 ORC (optimised relaxed clock) package is very good under these conditions and might be worth a try. 

Side note: Generally I find that fixing topology doesn’t help speed things up much so I am not sure it is worth it.

Cheers
Alexei

Sent from my iPhone

On 11/03/2024, at 7:53 AM, David Ortiz <davidom...@gmail.com> wrote:

Hello:
--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beast-users/b964752d-83bc-4176-99b3-421155fd676cn%40googlegroups.com.

Farman, Mark L.

unread,
Mar 10, 2024, 4:23:42 PM3/10/24
to beast...@googlegroups.com
Maybe you don’t have any useful phylogenetic signal. 

Mark L. Farman 
Professor, Department of Plant Pathology

On Mar 10, 2024, at 2:53 PM, David Ortiz <davidom...@gmail.com> wrote:


You don't often get email from davidom...@gmail.com. Learn why this is important
CAUTION: External Sender

--

Alexei Drummond

unread,
Mar 10, 2024, 4:33:14 PM3/10/24
to beast-users
The phenomenon he is describing is unlikely to be caused by lack of phylogenetic signal in my experience. It is well known that even perfect data on branch lengths can’t identify divergence times and branch rates without prior information, so in that trivial sense all data sets lack the required phylogenetic signal (sans external information about either divergence times or branch rates).

It sounds much more likely that there are multiple modes in divergence times versus branch rates (expected) but that the ridge in parameter space along these almost equally probably parameter combinations are narrow because of a large alignment (also expected). I would not be surprised if this is accentuated by restricting the topology as there might be intermediates ruled out a priori that make mixing for the operators harder. A couple of papers have been written recently about operators that try to address this problem:

Zhang, R., Drummond, A. Improving the performance of Bayesian phylogenetic inference under relaxed clock models. BMC Evol Biol 20, 54 (2020). https://doi.org/10.1186/s12862-020-01609-4

Douglas J, Zhang R, Bouckaert R (2021) Adaptive dating and fast proposals: Revisiting the phylogenetic relaxed clock model. PLoS Comput Biol 17(2): e1008322. https://doi.org/10.1371/journal.pcbi.1008322

I am sure there are others but these are the papers I am familiar with in the BEAST2 ecosystem.

Cheers
Alexei
> To view this discussion on the web visit https://groups.google.com/d/msgid/beast-users/BE221300-86D7-4F83-BC7C-B7E42746B96E%40uky.edu.

Farman, Mark L.

unread,
Mar 10, 2024, 4:56:17 PM3/10/24
to beast...@googlegroups.com
Just making the comment because ALL datasets I’ve looked at recently (from multiple groups) lack phylogenetic signal. 

Mark L. Farman 
Professor, Department of Plant Pathology

On Mar 10, 2024, at 4:33 PM, Alexei Drummond <alexei....@gmail.com> wrote:

[You don't often get email from alexei....@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

CAUTION: External Sender



The phenomenon he is describing is unlikely to be caused by lack of phylogenetic signal in my experience. It is well known that even perfect data on branch lengths can’t identify divergence times and branch rates without prior information, so in that trivial sense all data sets lack the required phylogenetic signal (sans external information about either divergence times or branch rates).

It sounds much more likely that there are multiple modes in divergence times versus branch rates (expected) but that the ridge in parameter space along these almost equally probably parameter combinations are narrow because of a large alignment (also expected). I would not be surprised if this is accentuated by restricting the topology as there might be intermediates ruled out a priori that make mixing for the operators harder. A couple of papers have been written recently about operators that try to address this problem:

Zhang, R., Drummond, A. Improving the performance of Bayesian phylogenetic inference under relaxed clock models. BMC Evol Biol 20, 54 (2020). https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1186%2Fs12862-020-01609-4&data=05%7C02%7Cmark.farman%40uky.edu%7C16c255d405a14f46f76f08dc41414ef9%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C638456995956528447%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ViDad6EVPQULe0iEm6L1ZGBwW89y0AmpD7UI3U0ayr0%3D&reserved=0

Douglas J, Zhang R, Bouckaert R (2021) Adaptive dating and fast proposals: Revisiting the phylogenetic relaxed clock model. PLoS Comput Biol 17(2): e1008322. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1371%2Fjournal.pcbi.1008322&data=05%7C02%7Cmark.farman%40uky.edu%7C16c255d405a14f46f76f08dc41414ef9%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C638456995956535993%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=AVM5GCVA8qwkFPdR59zlqqiTgATK%2Fts5JO6sKdxvpFs%3D&reserved=0

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.

Farman, Mark L.

unread,
Mar 10, 2024, 5:07:29 PM3/10/24
to beast...@googlegroups.com
A third party dataset I am working on right now is behaving in a similar way. However, this dataset has no true phylogenetic signal and I am only testing to see if I can get convergence because I am trying to ascertain how far off “truth” the resulting MRCAs will be. 

Mark L. Farman 
Professor, Department of Plant Pathology

On Mar 10, 2024, at 4:56 PM, Farman, Mark L. <mark....@uky.edu> wrote:


CAUTION: External Sender

David Ortiz

unread,
Mar 10, 2024, 5:55:04 PM3/10/24
to beast-users
Dear Alexei and Mark:

Thanks a lot for your comments and suggestions.

Since yesterday, I have been running a new trial after increasing the number of gamma categories of the HKY+I+G model from 4 to 7, trying to incorporate more of the rate heterogeneity along such a large matrix into the model.

If I still get poor mixing, I will try the BEAST2 ORC, as Alexei suggests and/or remove topological constraints.

By the way, there is a lot of phylogenetic signal in the matrix. I fixed nearly all nodes given that concatenated  (IQ-TREE and EXABAYES) and coalescent (ASTRAL) analyses previously run gave almost identical topologies, with high branch support. I thought that it would simplify noticeably the searches, but now I see from Alexei's comment that this is probably not the case and that this might even be counterproductive.

Cheers,

DAVID

Gaspary Eugene

unread,
Mar 11, 2024, 3:01:47 AM3/11/24
to beast...@googlegroups.com
Have  you ever tried to optimize operator weights? particularly  the TREE HEIGHT and CLOCK RATE?  These parameters are normally highly negatively correlated, you can view this relationship on Tracer ..on joint marginal to check (Tree height vs clock rates)..if this does not exist .....you can try to increase the weight of clock rate on operator panel...by increasing clock rate means that less time is needed for substitution to accumulate along the branches meaning branches can be shorter ..........

gaspary

Julian Tang

unread,
Mar 11, 2024, 1:25:40 PM3/11/24
to beast...@googlegroups.com
This is an interesting discussion.

I haven’t used BEAST for a while now, but I still read these posts occasionally.

As a virologist, I’ve sometimes wondered what would happen if I just threw various RNA viruses (enteroviruses, flaviviruses, bunyaviruses, rhabdoviruses, orthomyxoviruses, paramyxoviruses, etc.), into a massive alignment and ran BEAST (or BEAST 2) on this - to try to force a convergence to see  what I find - though any MRCA, etc. might be quite meaningless…

Julian 



David Ortiz

unread,
Mar 24, 2024, 9:32:54 PM3/24/24
to beast-users
Hi again:

Thank you all for your suggestions and contributions. As a follow-up, just to let you know that the strategy suggested by Alexei ( BEAST2 ORC package and removing unnecessary topological constraints) worked like a charm. I achieved good mixing and convergence and good ESS values even after 25 million generations. Several replicates reached the same solution, with divergence times almost exactly as those obtained with MCMCTREE of PAML.
I also rolled back to a GTR+I+G model with the default 4 gamma categories, which ran quite faster than the HKY+I+G model with 7 gamma categories.

DAVID
Reply all
Reply to author
Forward
0 new messages