How to run ML search and BS 100 replicates most rapidly for a 30000 taxa * 30000 bp DNA dataset

192 views
Skip to first unread message

Yunshi Liao

unread,
May 30, 2020, 10:37:09 AM5/30/20
to raxml
Hi everyone,

I am running RaxML v8.2.4 for a big DNA dataset (30000 taxa * 30000 bp, ~28000 patterns), and I am using a Linux computer with 1T RAM, 40 threads. And I want to search for the best–scoring ML tree and run 100 bootstrapping.

Then I am confused about what RaxML version and which tree serach method I should use. And I have the following puzzles.

1. Since I am using a single computer rather than cluster, the most suitable RaxML version should be PThreads? I saw some people said we can use MPI for single computer to treat different threads as different nodes, is that useful and faster for my case?

2. Based on my understanding after reading the manual, it seems that MPI version should be used for multiple inferences or rapid/standard BS (bootstrap) searches in parallel. So can we use raxmlHPC-MPI to run rapid Bootstrap analysis and search for bestscoring ML tree in one (-f a -x)?

3. Based on the answers of above questions, which algorithm should be faster? rapid Bootstrap analysis and search for bestscoring ML tree in one (-f a -x) or standard tree search + BS (-f d, then -f d -b)?

4. Should I use other program to run such dataset? For example, ExaML? RAxML-NG?


Thanks for your help!
Yunshi

Alexey Kozlov

unread,
May 30, 2020, 1:00:08 PM5/30/20
to ra...@googlegroups.com
Hi Yunshi,

I would recommend to use RAxML-NG, probably try "--search1" command first to get and idea of how
long a single tree search will take (and given the dataset size, it will take many hours). You can
find RAxML-NG tutorial here:

https://github.com/amkozlov/raxml-ng/wiki/Tutorial

Also please make sure you remove duplicate sequences from your alignment (if any), since they slow
down tree inference without contributing any additional information.

Also just in case you are working with SARS-Cov-2 sequences, please make sure you perform proper
alignment filtering - otherwise chances are that most of 28,000 patterns contain noise rather than
signal (sequencing errors, artifacts etc.)

Best,
Alexey
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/4ddf2bb4-2336-4917-ae2b-f093b9d7aaa5%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/4ddf2bb4-2336-4917-ae2b-f093b9d7aaa5%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sam

unread,
May 31, 2020, 3:56:04 AM5/31/20
to raxml
Dear Alexey,

I guess many of us are facing similar problems. Last few posts including mine in the group have a related problem. 
Mostly everyone has genome alignment of >12,000 strains with around 30,000 columns. Although, 90% of there columns are identical. 

It really would be helpful if you can provide the commands to use for such dataset 

Thanks

Alexandros Stamatakis

unread,
May 31, 2020, 10:55:37 PM5/31/20
to ra...@googlegroups.com
The analysis of this type of dataset is very challenging and requires
spacial care, the 90% identical columns are not a problem as all these
identical sites will be compressed into site patterns, thus not really
slowing down the analysis as the alignment on which the searches will be
conducted is effectively much shorter. Nontheless they should be kept in
the alignment as the high degree of identical site patterns affects the
inferred branch lengths.

Nonetheless, the dataset has insufficient phylogenetic signal, thus it
can and should not be analyzed using some standard command line that we
provide you here but requires a more involved analysis carefully
exploring if there is sufficient signal to even represent the result as
a binary/bifurcating tree which I personally seriously doubt.

Alexis
> > ra...@googlegroups.com <javascript:>
> <mailto:ra...@googlegroups.com <javascript:>>.
> <https://groups.google.com/d/msgid/raxml/4ddf2bb4-2336-4917-ae2b-f093b9d7aaa5%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/4ddf2bb4-2336-4917-ae2b-f093b9d7aaa5%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

Grimm

unread,
Jun 1, 2020, 12:14:48 PM6/1/20
to raxml
Hej,

Here's a possible experiment layout when dealing with virus data (copy&pasted from my next post – note what Alexi already stated: in fact, inferring a ML tree using CoV-2 data or similar is generally not a good idea).
  1. Step: Remove duplicates — duplicates increase the computational load while providing no additional information for the tree optimisation: The flatter the likelihood surface of the tree space, the longer takes the bootstrapping and the final optimisation.
  2. Step: Remove satellite genotypes — same reason. One may even think about removing intermediate types, too, by just filtering all tips that differ less than a fixed threshold of mutations.
  3. Step: Test for and prune rogues — try RogueNaRok (GitHub, Blackbox)
  4. Step: Infer a backbone tree for the pruned-down set — as framework for next steps.
  5. Step: Explore topological ambiguity — check the BS consensus networks: are there alternatives with similar support (split BS support) or no discriminate signal (all alternatives with very low BS support)
  6. Step: Optimise position of filtered rogues using RAxML's evolutionary placement algorithm — Recombinants may have more than one optimal position, ancestors will have more than one.
  7. Step: Hack-and-slash — Optimise poorly resolved subtrees using population genetic methods such as minimum-spanning, median-joining and statistical parsimony. Check the subalignments visually. 
Parts of this protocol (especially Step 1, Step 3, modified Step 7) may also help to analyse extremely taxon-rich matrices (such as 30,000 x 30,000), especially when having no access to a CPU cluster. There are very few reasons in optimising a 30,000-tip phylogeny for most evolutionary questions. One can establish genetically coherent groups (low ingroup divergence, substantial intergroup difference) – one neat progamme for this is optsil when you have a taxonomic scheme at hand – using quick, distance-based approaches and then infer a backbone phylogeny using placeholders for each group and stuff it trees using only members of one group.

More often than not, super-matrices can benefit from super-network/consensus network approaches.

Grimm

unread,
Jun 1, 2020, 12:49:35 PM6/1/20
to raxml
Hi Yunshi,

 
I am running RaxML v8.2.4 for a big DNA dataset (30000 taxa * 30000 bp, ~28000 patterns),

Given the matrix dimension and DAP, it seems you are up for a getting a comprehensive, all-inclusive Coronavirus not only a SARS-CoV-2 tree (if this is SARS-CoV-2 data than it is in very bad shape, see Alexej's comment, too many patterns). Note that analysing such a dataset using ML tree inference (or any other tree inference) is probably impossible because already within the SARS group (including original SARS and SARS-CoV-2) we seem to have issues with larger-scale recombination, which means parts of the virus genomes have different evolutionary histories.

NNetCPlusRecomb.png


Recombination will not be obvious from any inferred tree, at best, it will manifest in split BS support for strongly different topological alternatives.

Already in case of the SARS family (ie. probably much worse for all Betacoronavirus or all Coronavirus), tree-unlike signal due to recombination adds to sequence portions that are must be homologuous based on their position in the virus genomes but have otherwise nothing in common but can be up to 100% different (i.e. you have to deal with extreme forms of LBA, which even ML cannot handle)


CloseUpsLostComplexity.png


You will always get a tree, but whether it's a meaningful estimate for a phylogeny, is a much different question (pics are from the related March post on Genealogical World of Phlyogenetic Networks; the alignment we used is freely available on figshare)

Good inference, Guido
Message has been deleted

Grimm

unread,
Jun 2, 2020, 10:10:53 AM6/2/20
to raxml
PS Related post triggered by questions in this thread, Inferring a ML tree with 12000 (or more) virus genomes, with a bit of simple, introductory artwork and further links and few more tips, is now online at my Res.I.P. blog (hosted via Blogger, for reasons of convenience; utterly pro bono from my side but being a Google service not globally accessible).

Cheers, G.


Yunshi Liao

unread,
Jun 2, 2020, 1:13:50 PM6/2/20
to raxml
Dear Alexey and Alexis,

Really thanks for your help!

As suggested by Alexey, I just ran a quick-and-dirty best tree search using duplicate removed alignment. Would try to get more comment from you later on.

And for the SARS-COV2 dataset, I totally agree that the high similarity among sequences may not provide enough signal to build a "good" binary/bifurcating tree.

In term of site pattern, may I ask the definition of "site pattern" in RaxML is the same as others, e.g., PAML, which is used to compress the sites sharing the same base composition and arrangement along taxa?

I agree that the large size of site patterns here is weird and I would double check carefully on the alignment to get rid of misalignment. Please feel free to suggest me any useful tools to examine alignment quality and find out errors.

While I am a bit confused about the definition of "invariant site" here. Based on the screenshot below, more than 50% sites are invariant, then I would think that those ~14000 sites should be compressed into 4 patterns (all A or T or C or G). Then the total pattern count should be at most 15000, which is not consistent with the current pattern number. May I know if my understanding of these 2 items here is wrong?

IMG_8956(20200602-212542).PNG


And a stupid question, how many rounds of SPR would normally conduct? As such process seems not included in stdout of previous standard RaxML.

Lastly, I found that in RAxML-NG, both pthreads and MPI apply fine-grained parallelization; then for a single-node local computer, I think using pthreads with suitable amount of threads/cores would be more efficient than MPI? 

Thanks.

在 2020年6月1日星期一 UTC+8上午10:55:37,Alexis写道:
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

在 2020年6月1日星期一 UTC+8上午10:55:37,Alexis写道:
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

在 2020年6月1日星期一 UTC+8上午10:55:37,Alexis写道:
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

在 2020年6月1日星期一 UTC+8上午10:55:37,Alexis写道:
Message has been deleted

Alexandros Stamatakis

unread,
Jun 2, 2020, 11:48:28 PM6/2/20
to ra...@googlegroups.com
Dear Yunshi,

> Really thanks for your help!

:-)

> As suggested by Alexey, I just ran a quick-and-dirty best tree search
> using duplicate removed alignment. Would try to get more comment from
> you later on.
>
> And for the SARS-COV2 dataset, I totally agree that the high similarity
> among sequences may not provide enough signal to build a "good"
> binary/bifurcating tree.
>
> In term of site pattern, may I ask the definition of "site pattern" in
> RaxML is the same as others, e.g., PAML, which is used to compress the
> sites sharing the same base composition and arrangement along taxa?

Yes, this technique is used in all ML and Bayesian inference programs.

> I agree that the large size of site patterns here is weird

It is not, since there are very few mutations in the data, it's more
like a pop. gen. dataset than a phylogenetic dataset.

> and I would
> double check carefully on the alignment to get rid of misalignment.
> Please feel free to suggest me any useful tools to examine alignment
> quality and find out errors.

That is not really our area of expertise.

> While I am a bit confused about the definition of "invariant site" here.
> Based on the screenshot below, more than 50% sites are invariant, then I
> would think that those ~14000 sites should be compressed into 4 patterns
> (all A or T or C or G).

Yes.

> Then the total pattern count should be at most
> 15000, which is not consistent with the current pattern number. May I
> know if my understanding of these 2 items here is wrong?

There are invariant sites, but then, there are also identical sites that
are not invariant.

>
> IMG_8956(20200602-212542).PNG
>
>
> And a stupid question, how many rounds of SPR would normally conduct? As
> such process seems not included in stdout of previous standard RaxML.

This is very dataset-dependent, i.e., one can not predict it. However,
on a dataset with weak signal you would expect many more SPR rounds than
on a dataset with strong signal.

> Lastly, I found that in RAxML-NG, both pthreads and MPI
> apply fine-grained parallelization; then for a single-node local
> computer, I think using pthreads with suitable amount of threads/cores
> would be more efficient than MPI?

Not necessarily, as the efficiency of the parallelization depends on the
number of distinct site patterns, the rule of thumb is to run 1000 site
patterns on one core, so you may be better off running several
independent tree searches on your machine.

Alexis
> https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
> https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
> https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
> https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/9336b98f-2561-40eb-8fa6-98574a456caa%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/ccb25c77-4b16-4ec3-b25d-d43af5c0edd5%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/ccb25c77-4b16-4ec3-b25d-d43af5c0edd5%40googlegroups.com?utm_medium=email&utm_source=footer>.

Grimm

unread,
Jun 3, 2020, 7:50:21 AM6/3/20
to raxml
Hej,

I agree that the large size of site patterns here is weird and I would double check carefully on the alignment to get rid of misalignment. Please feel free to suggest me any useful tools to examine alignment quality and find out errors.

If these are all CoV-2 genomes, mis-alignment can hardly be an issue; but reflect a high amount of random, stochastically distributed mutations in the data set. They may be genuine (recombinants, convergence, and backmutations will increase the number of distinct alignment patterns) or sequencing artefacts. Sequencing artefacts are obvious in a few (non-CoV-2) sequences stored in the gene banks (repeated one missing or extra nucleotide in one sequence of hundreds, but difficult to spot in the case of SNPs) and this will be a problem for uncurated GISAID data as well (or even more). While a handful of poor base calls (10 out of more than 15000 variable sites potentially providing a phylogenetic signal) don't matter for higher-level phylogenies, they are a pain when you work at the coalface of evolution as in the case of SARS-CoV-2, where the main clades are defined by max. 5 shared mutations in the entire genome (see GISAID phylogeny)

In either case, genuine or artefacts, those additional patterns are utterly useless for optimisation and you need to get them out of the matrix before you run any tree inference. E.g. by eliminating all sites that differ in just in a fraction of the accessions.

For comparison, the alignment for the guide tree in our posts – only 400, but covering genomes of all SARS-type coronoviruses – and filtered for highly divergent, non-alignable regions ended up with only 6902 DAPs for 27741 sites. The max. genetic distance in our data was 0.2 compared to 0.0006 for the set of early sequenced CoV-2 (this has probably now doubled).

Happy filtering, Guido

Yunshi Liao

unread,
Jun 3, 2020, 10:47:36 AM6/3/20
to raxml
Dear Alexis,

Thanks for your description! So it seems "invariant sites" is more or less a free parameter estimated by RaxML rather than a statistic directly derived from alignment? And it seems "invariant sites" cannot be used as a direct hint to deduce "site pattern"?

Then after a few days running, my run seems meet some issues. Could you help me to have a look?

IMG_8970(20200603-223954).JPG


Thanks!



在 2020年6月3日星期三 UTC+8上午11:48:28,Alexis写道:

Yunshi Liao

unread,
Jun 3, 2020, 10:58:17 AM6/3/20
to raxml
Dear Guido,

In either case, genuine or artefacts, those additional patterns are utterly useless for optimisation and you need to get them out of the matrix before you run any tree inference. E.g. by eliminating all sites that differ in just in a fraction of the accessions.

So here you want to suggest is to remove those sites with only a little amount of minor variants within? As those "variants" may actually be artefacts and it is useless for tree optimisation but lead to great additional computational pressure? And this "variants" may lead to oversized pattern count as my data here.

Thanks.

在 2020年6月3日星期三 UTC+8下午7:50:21,Grimm写道:

Grimm

unread,
Jun 3, 2020, 11:38:06 AM6/3/20
to raxml


Am Mittwoch, 3. Juni 2020 16:58:17 UTC+2 schrieb Yunshi Liao:
Dear Guido,

In either case, genuine or artefacts, those additional patterns are utterly useless for optimisation and you need to get them out of the matrix before you run any tree inference. E.g. by eliminating all sites that differ in just in a fraction of the accessions.

So here you want to suggest is to remove those sites with only a little amount of minor variants within? As those "variants" may actually be artefacts and it is useless for tree optimisation but lead to great additional computational pressure? And this "variants" may lead to oversized pattern count as my data here.


Exactly, here' s a picture depicting the extreme (worst) case: set of random transitions from C to U conflicting with a AG vs. GA pattern seperating the two principal lineages. It's virtually impossible for tree inference to solve this problem speedily and comprehensively.

InflatedDAP.png



With CoV-2 data (and alike) you have a multitude of such situations, plus cases where the incompatibility may result from actual recombination (posted the following already in another thread, but can't hurt to repeat here; these are actual CoV-2 mutation patterns that can be observed in the early samples given the fine incubation and recombination potential on the Diamond Queen).

MutationPatterns2.png

MutationPatterns1.png

(These two pics are from this post: Using Median Networks to study SARS-CoV-2, a comment to Forster et al., PNAS, 2020)

There is of course no rule at which frequency to drop an alignment pattern, e.g. if you only have two samples from a remote place but of different affinity, their shared unique mutations (especially if consecutive) may be genuine and direct evidence for local recombination of two infesting strains infecting a closed-up population (like the cruise ships). Requires probably some experimenting with using different thresholds.

And one should not forget that for each site I drop, I increasingly misinform RAxML regarding the amount of invariance, which is a parameter RAxML-NG makes use of during optimisation (check out also the many threads on "ascertainment bias correction")


Alexandros Stamatakis

unread,
Jun 3, 2020, 11:50:14 PM6/3/20
to ra...@googlegroups.com
Dear Yunshi,

> Thanks for your description! So it seems "invariant sites" is more or
> less a free parameter estimated by RaxML rather than a statistic
> directly derived from alignment?

Well it's both actually, there are the invariant sites in the MSA and
then there is the proportion of invariant sites estimate.

> And it seems "invariant sites" cannot
> be used as a direct hint to deduce "site pattern"?

Correct.

>
> Then after a few days running, my run seems meet some issues. Could you
> help me to have a look?
>
> IMG_8970(20200603-223954).JPG <about:invalid#zClosurez>

It might be that you don't have enough RAM.

Alexis
> <http://www.exelixis-lab.org>
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "raxml" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to ra...@googlegroups.com <javascript:>
> > <mailto:ra...@googlegroups.com <javascript:>>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/raxml/ccb25c77-4b16-4ec3-b25d-d43af5c0edd5%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/ccb25c77-4b16-4ec3-b25d-d43af5c0edd5%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/raxml/ccb25c77-4b16-4ec3-b25d-d43af5c0edd5%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/ccb25c77-4b16-4ec3-b25d-d43af5c0edd5%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/657d6ef2-d9be-4b76-baca-15d824ec7354%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/657d6ef2-d9be-4b76-baca-15d824ec7354%40googlegroups.com?utm_medium=email&utm_source=footer>.
Message has been deleted

Yunshi Liao

unread,
Jun 4, 2020, 3:03:44 AM6/4/20
to raxml
Dear Alexis,

I am using a computer with 650GB RAM available but maybe still not that enough especially the big taxa size and pattern number here.

Do you think it may also be related to the oversubscription of threads I used. I know RAxML-NG shows best performance with one thread per physical core. And core oversubscription is counter-productive and can lead to a major (>10x) slowdown. But I mistakenly used 20 threads for this run while my PC possesses 2 CPUs each with 8-core, so I guess I should not use more than 16 threads...

I will definitely control the thread number I use next run, while I am just curious that if that oversized threads would even lead to RAM shortage?

I will also try to remove those "fake" noise in the alignment to reduce "artificial" patterns, but do you think it is meaningful to remove those sites with all identical base which I think it is not phylogenetic-informative.

Thanks.

在 2020年6月4日星期四 UTC+8上午11:50:14,Alexis写道:
Message has been deleted

Grimm

unread,
Jun 4, 2020, 4:21:34 AM6/4/20
to raxml
Hej,


 
I will also try to remove those "fake" noise in the alignment to reduce "artificial" patterns, but do you think it is meaningful to remove those sites with all identical base which I think it is not phylogenetic-informative.


Removing invariable sites will have little effect on the speed, but if you do remove them, you need to correct for ascertainment bias. If you don't, the optimised branch-lengths can be severely off. Check out this result for a simple, artificial binary matrix entering the Felsenstein Zone and falling prey to LBA (C is sister of A, B of D but because the fossils are much more primitive than their extant sisters, they cannot be properly resolved: this is a situation you can also run into with virus data if there are white zones in the sample)
ML,unc. – ML, not corrected for ascertainment bias; asc. – corrected: note the different scales!

FelsensteinAllInfTrees.png


So, removing invariable sites will probably inflict more problems than it solves.

Alexandros Stamatakis

unread,
Jun 4, 2020, 7:39:12 AM6/4/20
to ra...@googlegroups.com
Dear Yunshi,

> I am using a computer with 650GB RAM available but maybe still not that
> enough especially the big taxa size and pattern number here.

In that case, RAM is surely not an issue.

> Do you think it may also be related to the oversubscription of threads I
> used. I know RAxML-NG shows best performance with one thread per
> physical core. And core oversubscription is counter-productive and can
> lead to a major (>10x) slowdown. But I mistakenly used 20 threads for
> this run while my PC possesses 2 CPUs each with 8-core, so I guess I
> should not use more than 16 threads...

Exactly, the slowdown induced by oversubscription is extremely high in
general.

> I will definitely control the thread number I use next run, while I am
> just curious that if that oversized threads would even lead to RAM shortage?

Most likely noy.

> I will also try to remove those "fake" noise in the alignment to reduce
> "artificial" patterns, but do you think it is meaningful to remove those
> sites with all identical base which I think it is not
> phylogenetic-informative.

See Guido's reply.

Alexis
> https://groups.google.com/d/msgid/raxml/657d6ef2-d9be-4b76-baca-15d824ec7354%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/657d6ef2-d9be-4b76-baca-15d824ec7354%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/raxml/657d6ef2-d9be-4b76-baca-15d824ec7354%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/657d6ef2-d9be-4b76-baca-15d824ec7354%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/20264c18-48d0-486f-ad17-b83e5efe547b%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/20264c18-48d0-486f-ad17-b83e5efe547b%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages