RAxML & RAxML-NG on SNP dataset | Has the computation time increased?

1,068 views
Skip to first unread message

George Pacheco

unread,
Oct 9, 2017, 5:57:52 AM10/9/17
to raxml
Dear RAxML community,

I have been trying to run both RAxML and RAxML-NG on a SNP data (225 individuals / 24.876 SNPS) with the following commands (outputs follow):

raxmlHPC-PTHREADS-SSE3 -T 23 -f a -x 13111 -p 13552 -N autoMRE -m ASC_GTRCAT -V --asc-corr=lewis -s ./PBGP--GoodSamples_MinMaf-0.005_doHaploCall.ANGSD.fasta -n PBGP--GoodSamples_MinMaf-0.005_doHaploCall.ANGSD -W ./RAxML/

Alignment has 24509 distinct alignment patterns


Proportion of gaps and completely undetermined characters in this alignment: 0.25%


RAxML rapid bootstrapping and subsequent ML search


Using 1 distinct models/data partitions with joint branch length optimization


Executing 1000 rapid bootstrap inferences and thereafter a thorough ML search


All free model parameters will be estimated by RAxML

ML estimate of 25 per site rate categories


Likelihood of final tree will be evaluated and optimized under GAMMA


GAMMA Model parameters will be estimated up to an accuracy of 0.1000000000 Log Likelihood units


Partition: 0

Alignment Patterns: 24509

Name: No Name Provided

DataType: DNA

Substitution Matrix: GTR

Correcting likelihood for ascertainment bias


Time for BS model parameter optimization 75.363249

Bootstrap[0]: Time 11955.841476 seconds, bootstrap likelihood -2204291.142658, best rearrangement setting 5

Bootstrap[1]: Time 17819.758529 seconds, bootstrap likelihood -2202191.776994, best rearrangement setting 7

Bootstrap[2]: Time 17196.996258 seconds, bootstrap likelihood -2214420.067035, best rearrangement setting 7


raxml-ng --threads 46 --all --model GTR+G+ASC_LEWIS --bs-trees 1000 --msa ./PBGP--GoodSamples_MinMaf-0.005_doHaploCall.ANGSD.fasta --prefix ./RAxML/PBGP--GoodSamples_MinMaf-0.005_doHaploCall.ANGSD_RAxML-NG

Alignment comprises 1 partitions and 24509 patterns


Partition 0: noname

Model: GTR+FO+G4m+ASC_LEWIS

Alignment sites / patterns: 24876 / 24509

Gaps: 0.25 %

Invariant sites: 0.00 %

[00:00:00] Generating random starting tree(s) with 225 taxa

[00:00:00] Data distribution: partitions/thread: 1-1, patterns/thread: 532-533


Starting ML tree search with 20 distinct starting trees


[00:42:15] ML tree search #1, logLikelihood: -2035666.342883

[01:34:42] ML tree search #2, logLikelihood: -2035738.641678

[02:31:51] ML tree search #3, logLikelihood: -2035483.909907

[08:24:59] ML tree search #4, logLikelihood: -2035543.074018

[10:25:38] ML tree search #5, logLikelihood: -2035540.813373


As you can see, even though I am using a fair amount of threads in both case (I reckon), these analyses have been taking a quite long and my dataset is not that huge (I reckon). May I kindly ask then if anyone would have any suggestion here? I used to run previous versions of RAxML with these parameters on bigger datasets and it was much faster back then. So, I reckon that at some point during the subsequent the computation time really increased.

Many thanks in advance, George.

Alexandros Stamatakis

unread,
Oct 9, 2017, 6:01:43 AM10/9/17
to ra...@googlegroups.com
Dear George,

How many distinct site patterns does your alignment have?

The relevant figure is the number of site patterns, i.e., unique sites.

Alexis

On 09.10.2017 11:57, George Pacheco wrote:
> Dear RAxML community,
>
> I have been trying to run both RAxML and RAxML-NG on a SNP data (225
> individuals / 24.876 SNPS) with the following commands (outputs follow):
>
> *raxmlHPC-PTHREADS-SSE3* -T *23* -f a -x 13111 -p 13552 -N autoMRE -m
> *raxml-ng* --threads *46* --all --model GTR+G+ASC_LEWIS --bs-trees 1000
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

George Pacheco

unread,
Oct 9, 2017, 6:07:24 AM10/9/17
to raxml
Dear Alexis,

Thanks very much for your super quick reply.

RAxML says: Alignment Sites: 24.876 / Patterns: 24.509
RAxML-NG says:  Alignment Sites: 24.876 / Patterns: 24.509

Is it too much? :(

Best, George.

Alexey Kozlov

unread,
Oct 9, 2017, 7:39:30 AM10/9/17
to ra...@googlegroups.com
Hi George,

how many physical CPU cores do you have?

25K patterns are not much at all, and I'd rather use less threads (8 to 24).

Also please note that you are using different models with RAxML and RAxML-NG:

RAxML: no rate heterogeneity, empirical base freqs, rapid bootstrapping
RAxML-NG: GAMMA model of rate het., ML estimate of base freqs, slow bootstrapping

Currently, RAxML-NG doesn't support rapid bootstrapping, otherwise you can get the same settings as in (old) RAxML
command line by using "--model GTR+F+ASC_LEWIS".

Best,
Alexey

On 09.10.2017 12:07, George Pacheco wrote:
> Dear Alexis,
>
> Thanks very much for your super quick reply.
>
> *RAxML* says: Alignment Sites: *24.876* / Patterns: *24.509*
> *RAxML-NG* says:  Alignment Sites: *24.876* / Patterns: *24.509
> *
> > an email to raxml+un...@googlegroups.com <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>

George Pacheco

unread,
Oct 9, 2017, 10:18:13 AM10/9/17
to raxml
Dear Alexis, 

Thanks very much for your input. My comments are below:

How many physical CPU cores do you have?
I have access to a maximum of 64 physical CPU cores.


25K patterns are not much at all, and I'd rather use less threads (8 to 24).
I have the tendency to think that more threads is always better, but I see that this is not always the case. I will adjust the number of threads when running this command a next time.

Also please note that you are using different models with RAxML and RAxML-NG: 

RAxML: no rate heterogeneity, empirical base freqs, rapid bootstrapping 
RAxML-NG: GAMMA model of rate het., ML estimate of base freqs., slow bootstrapping
Yeap, I see that. I was tried to run RAxML with the ASC_GTRGAMMA model removing the -V option, but I experienced a also extended computation time and decided to try the ASC_GTRCAT model with the -V option.


Currently, RAxML-NG doesn't support rapid bootstrapping, otherwise you can get the same settings as in (old) RAxML 
command line by using "--model GTR+F+ASC_LEWIS".
Thanks, I will try to include the empirical base freqs. next time. However, I would rather prefer t have the ML estimates if possible.

However, my main concern was actually on running time. Maybe I did not make myself completely clear before, but, if we take the RaxML-NG run as example, right now I am the GTR model with ML freqs. and applying the ASC_LEWIS correction. So, the program is first  generating 20 random starting trees (default) to subsequently perform 1,000 (my choice) slow bootstrap analysis on each of these trees. Using 46 threads, each of my  ML Tree Search is taking approx. 42min. Would I be wrong if I assumed that each BS would also take approx. 42min and that 1,000 BSs will be performed for each tree plus the best tree?  If so, the total time for my job to finished would be roughly: 42min*(20*1,001) = 584 days? Or are the random 20 trees only for the best ML tree? In this case, it would be: 42min*(20+1,000) = 30 days?

If that is the case, could I ask you if you would have any recommendation for how to decrease this computation time? 
  • Probably, decrease the number of BSs from 1,000 to 100 is a good start?;
  • Maybe also decrease the number of random starting trees from 20 to say 10 or 5?;
  • Work with empirical freqs. instead of ML freqs. estimates? Does it affect running time much?
  • Maybe I would use a less complex model than GTR? However, I am not quite sure if there is any other compatible with the ASC_LEWIS correction.
Finally,  how expensive is the ASC_LEWIS correction is? Seeing that I have SNP data (and so have to employ the ASC_LEWIS correction), would it be faster if I worked with the full dataset (SNPs + non variable sites)? It is not clear to me if by doing so I would be increasing or decreasing the computation time considering that the amount of total sites would be rather expanded. 

I do appreciate all you support!

Your best, George.

Alexey Kozlov

unread,
Oct 9, 2017, 12:05:35 PM10/9/17
to ra...@googlegroups.com
Hi George,

> How many physical CPU cores do you have?
> I have access to a maximum of 64 physical CPU cores.

OK it's fine then.

> 25K patterns are not much at all, and I'd rather use less threads (8 to 24).
> I have the tendency to think that more threads is always better, but I see that this is not always the case. I will
> adjust the number of threads when running this command a next time.

Unfortunately not, we usually recommend >1000 DNA patterns per thread, you can go below that threshold, but at some
point you will stop seeing runtime improvements from adding more threads.

> RAxML: no rate heterogeneity, empirical base freqs, rapid bootstrapping
> RAxML-NG: GAMMA model of rate het., ML estimate of base freqs., slow bootstrapping
> Yeap, I see that. I was tried to run RAxML with the *ASC_GTRGAMMA* model removing the *-V* option, but I experienced a
> also extended computation time and decided to try the *ASC_GTRCAT *model with the*-V *option.

OK I see.

> Currently, RAxML-NG doesn't support rapid bootstrapping, otherwise you can get the same settings as in (old) RAxML
> command line by using "--model GTR+F+ASC_LEWIS".
> Thanks, I will try to include the empirical base freqs. next time. However, I would rather prefer t have the ML
> estimates if possible.

That's all right, you can keep ML freqs, I just wanted to make sure you're aware of that (e.g. in case you want to
compare likelihood scores), since in RAxML-NG the default was changed from empirical to ML freqs.

> However, my main concern was actually on running time. Maybe I did not make myself completely clear before, but, if we
> take the RaxML-NG run as example,right now I am the* GTR *model with ML freqs. and applying the* ASC_LEWIS* correction.
> So, the program is first  generating 20 random starting trees (*default*) to subsequently perform1,000 (*my choice*)
> slow bootstrap analysis on each of these trees. Using 46 threads, each of my *ML Tree Search *is taking approx. 42min.
> Would I be wrong if I assumed that each BS would also take approx. 42min and that 1,000 BSs will be performed for each
> tree plus the best tree?  If so, the total time for my job to finished would be roughly: 42min*(20*1,001) = 584 days? Or
> are the random 20 trees only for the best ML tree? In this case, it would be: 42min*(20+1,000) = 30 days?

The latter calculation is correct (30 days). But usually BS replicate searches run faster, so this would be a
conservative/pessimistic estimate. Furthermore, you can parallelize across bootstraps to make better use of all 64 cores:

- with old RAxML, you can use raxmlHPC-HYBRID (please search in this google group for detailed explanations/recommendations)
- with RAxML-NG, you start multiple instances with "--bootstrap" command and different random seeds, then simlly
concatenate all *.raxml.bootstraps files and draw branch support with the "--support" command

> If that is the case, could I ask you if you would have any recommendation for how to decrease this computation time?
>
> * Probably, decrease the number of BSs from 1,000 to 100 is a good start?;

That's one option, you can also use bootstopping to find the appropriate number of replicates. With old RAxML, you can
specify "-N autoMRE", with RAxML-NG you can - for the time being - run bootstraps in batches of say 100, and then check
for convergence using the "-f B" option of the old RAxML.

> * Maybe also decrease the number of random starting trees from 20 to say 10 or 5?;

I would not recommend this, since most runtime will be spent on bootstrapping anyways.

> * Work with empirical freqs. instead of ML freqs. estimates? Does it affect running time much?

Usually it doesn't, so please use whatever you find more appropriate.

> * Maybe I would use a less complex model than GTR? However, I am not quite sure if there is any other compatible with
> the *ASC_LEWIS* correction.
In RAxML-NG, you can use any DNA matrix with any correction model. However, GTR is a good default choice. so I'd stick
to it.

> Finally,  how expensive is the *ASC_LEWIS* correction is? Seeing that I have SNP data (and so have to employ the
> *ASC_LEWIS* correction), would it be faster if I worked with the full dataset (SNPs + non variable sites)? It is not
> clear to me if by doing so I would be increasing or decreasing the computation time considering that the amount of total
> sites would be rather expanded.

Lewis (or any other asc. correction) is rather cheap. Full dataset will surely contain more sites, but most of them will
be compressed into just 4 patterns (it depends how much missing data you have). In any case, I'd expect the full dataset
to run slower.

Finally, could you please sen full RAxML-NG log file and - if possible - alignment file to my e-mail address.
Just in case, I will check if anything is going wrong on this dataset.

Best,
Alexey
> > www.exelixis-lab.org <http://www.exelixis-lab.org> <http://www.exelixis-lab.org>
> >
> > --
> > You received this message because you are subscribed to the Google Groups "raxml" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com
> <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --

George Pacheco

unread,
Oct 10, 2017, 3:20:33 PM10/10/17
to raxml
Dear Alexey,

Wow, these comments are immensely helpful!!! I will indeed follow all your suggestions!

Just a final question though: how can I control the random seed when using the --bootstrap command in RAxML-NG? Better said, do I need to control these seeds or every individual instance will automatically get a random seed?

A pentabyte of thanks, George.

P.S. - I sent the files you requested through individual reply. Please, let me know should you like them sent by any other way.

Alexey Kozlov

unread,
Oct 10, 2017, 4:12:26 PM10/10/17
to ra...@googlegroups.com
Dear George,

> Wow, these comments are immensely helpful!!! I will indeed follow all your suggestions!

:)

> Just a final question though: how can I control the *random seed *when using the***--bootstrap* command in RAxML-NG?
> Better said, do I need to control these seeds or every individual instance will automatically get a random seed?

That's true, each instance will get a different random seed by default (initialized from system clock), but for easier
reproducibility you can also set it manually with the "--seed" option.

> P.S. - I sent the files you requested through individual reply. Please, let me know should you like them sent by any
> other way.

I got your files, will check them shortly.

George Pacheco

unread,
Oct 10, 2017, 5:02:47 PM10/10/17
to raxml
Perfect then - thanks, Alexey !
Reply all
Reply to author
Forward
0 new messages