RAxML vs RAxML-ng on large alignement

121 views
Skip to first unread message

Gaetan

unread,
Nov 3, 2021, 3:37:44 AM11/3/21
to raxml
Hello,
I'm on a project where we generate a phylogenetic tree from a binary alignement we created. We have 6000 sequences and 250k characteristics per sequence. We are using RAxML, it's slow but it works great.

We increased the data to 16k sequences and 600k characteristics per sequence, but RAxML can't finish (a strange error in likehood computing).

Here is the command we use "raxmlHPC-PTHREADS-AVX2 -d -T 6 -p 12345 -m BINCAT -n sortie4 -o ERR2512533 -s alignment.fasta"

We wanted to try RAxML-ng with this command
"raxml-ng --msa T1.raxml.rba --model BIN+G --threads 10 --search1 --seed 12345 --outgroup ERR2512533 --prefix T2"

It's been two days, we are waiting to see how long it will take.

We are not experts in phylogenetric reconstruction, so we wanted to have your advice on our problem.
- Do you think our usage of raxml-ng is correct ?
- We are currently running the process on a single machine but we have access to a large datacenter, but from my understanding it's not a good idea to increase the threads too much (raxml suggested 12 and we used 10, because I was afraid of being out of memory, we have only 128GB on this machine)
- We generate the binary sequences ourself, so we can change the data if needed. We thought about removing characteristics present in only one sequence, but we are unsure of the impact.

Thank you so much

Grimm

unread,
Nov 3, 2021, 8:28:31 AM11/3/21
to raxml
Hi Gaetan,

Alexey and Alexis can probably tell you how long this will run (my guess is, even with RAxML-ng, it will need a couple of more days with 10 cores and 128 GB).

But aside the matrix dimensions, it's often the signal structure that slows down the processing of such large data sets. There are two main obstacles while roaming the tree-space (see also this paper on SARS-CoV-2 genomes, which represent an example of a particularly poor data set for inferring ML trees: https://academic.oup.com/mbe/article/38/5/1777/6030946):

Obstacle 1: Flat subtrees. If you have a lot of tips that are near-identical/very similar, the optimisation wastes a lot of computation time to put them into an order, even though each topological alternative is more or less equally probable. If you have already an idea about the basic structure of the tree to expect, run a backbone analysis using as few as possible placeholders and then use EPA to place the remainder of the 12k tips. If you don't have an idea, just use a clustering approach to define clusters of high-sequence similarity and randomly sample one or two tips from each cluster.

Obstacle 2: Terminal noise. With 600k characters you may have a lot of binaries that are randomly distributed rather than phylogenetically sorted. E.g. when your data includes saturated SNPs. You can easily view if this is a problem by generating a pairwise distance heat map or neighbour-net/neighbour-joining tree. If the tip branches/edges are very long compared to the internal branches, you may want to filter further.

Have patience and good inference, Guido

Grimm

unread,
Nov 3, 2021, 10:51:17 AM11/3/21
to raxml
PS Re: "We thought about removing characteristics present in only one sequence, but we are unsure of the impact." – This won't speed up anything; character-based inferences are not bothered by singletons (a tree is a summary of compatible taxon bipartitions, and a singleton defines a trivial, per se compatible with any alternative one). It will naturally reduce the length of terminal branches but should have no effect on the topoloy. For nucleotide and aminoacid data it may have a distorting impact on model optimisation but in case of binary data this should not be an issue.

Alexey Kozlov

unread,
Nov 3, 2021, 12:12:00 PM11/3/21
to ra...@googlegroups.com
Hi Gaetan,

you should consider the (always much appreciated!) suggestions by Guido and try to reduce you
dataset. Here are a couple of hints from my side:

> We are not experts in phylogenetric reconstruction, so we wanted to have your advice on our problem.
> - Do you think our usage of raxml-ng is correct ?

I'd try "--search --tree pars{1}" instead, it will probably converge (much) faster than starting
from a random tree.

> - We are currently running the process on a single machine but we have access wto a large datacenter,
> but from my understanding it's not a good idea to increase the threads too much (raxml suggested 12
> and we used 10, because I was afraid of being out of memory, we have only 128GB on this machine)

I'd try more threads, since memory consumption will most likely grow rather moderately for this
analysis..

> - We generate the binary sequences ourself, so we can change the data if needed. We thought about
> removing characteristics present in only one sequence, but we are unsure of the impact.

This will obviously reduce computation overhead a bit, but will also reduce parallelization
scalability, so depending on how many threads/cores you have in your system, you might not gain too
much.

In general, for any large dataset, I always recommend doing some exploration before submitting a
final run and waiting for weeks until it finishes. For instance, if a single SPR round takes many
days, or the number of rounds goes into hundreds (or even above 30-40), this is a clear sign that
you should have a second look at your dataset / analysis parameters / parallelization.

Hope this helps,
Alexey

Gaetan

unread,
Nov 4, 2021, 4:25:45 AM11/4/21
to raxml
Hello
Thank you so much for these advices. Our tree is Tuberculosis so we have a precise idea of what to expect, and we already have a tree with 6000 sequences, so I'll try to play with the starting tree.
We never tried EPA so we will take a look !

Thank you so much again

Grimm

unread,
Nov 4, 2021, 4:45:39 AM11/4/21
to raxml
EPA is always worth a try.

Important for EPA newbies: Pending the resolution in the ref tree (e.g. your 6000-tip tree will probably still include quite some flat terminal subtrees), you may get low LWR for the best hit.

Then it's worth checking the alternatives in the jPlace file

In principle,
  • low LWR at a deep/early diverged branch = query is a rogue/hard to place; the jPlace will give alternatives scattered in the root-proximal ("basal") part of the reference tree
  • low LWR in a terminal, rather flat subtree = query is just another variant of this subtree; the jPlace will only have alternative placements within the same subtree
This thread gives you an example how split LWR can look like (phylogeny-sorted spread-sheet version of the jPlace file):

To avoid having an inflated number of low LWR hits to cross-check, also here pruning the reference tree is the solution; to the minimum necessary tips needed to answer the relevant question: e.g. the hierarchical level one is interested in or to filter/group the data for subclade tip sets for downstream comprensive inferences.

Happy placing
/G
Reply all
Reply to author
Forward
0 new messages