starting ML optimization from random starting tree

379 views
Skip to first unread message

SoniaN

unread,
Sep 30, 2020, 6:17:45 PM9/30/20
to raxml
Hello. Does starting the RAxML search with a complete random starting tree instead of the default help a case where the tree space is quite flat to start with? My dataset is relatively small (~ 300 taxa, less than 2k character positions) and the resulting best ML tree using the “full” raxml in CIPRES with default parameters shows very low bootstrap values (mostly zeros) especially for the branches making the backbone of the tree. I assumed that meant I need more data to resolve the backbone, but a collaborator on the study says that using the raxml default in cipres was my mistake and doing  "multiple searches with different random OTU input orders” is what I should have done. Does that mean starting the searches with a complete random starting tree? I have also been told that the number of my tree searches can be an issue. I am not sure what number to pick. It seems I can go up to 1000 in raxml and I believe that the default is 10? In what cases would the default be enough? Or not enough? Any guidance you can offer me would be much appreciated.

Alexey Kozlov

unread,
Oct 1, 2020, 5:16:14 AM10/1/20
to ra...@googlegroups.com
Hi Sonia,

first of all, if you are still using the "old" RAxML8.x, I recommend upgrading to RAxML-NG
(https://github.com/amkozlov/raxml-ng), which is also available on CIPRES.

Second, it is always a good idea to use both completely random and parsimony-based starting trees
with RAxML(-NG). It is hard to tell what number of starting tress is "sufficient". But as a rule of
thumb, if you see a lot of variation in the log-likelihoods between the final trees (which is likely
on a dataset with your dimensions), then it is worth increasing the number of starting trees.

Finally, it is not surprising that you got low bootstrap support values on deep branches. This is
due to combination of low signal (2k characters) and the way Felsenstein's bootstrap works. There is
an alternative support metric called Transfer Bootstrap Expectation (TBE), which can alleviate this
problem:

https://www.nature.com/articles/s41586-018-0043-0

TBE is implemented in RAxML-NG and can be easily computed from existing bootstrap trees:

raxml-ng --support --tree bestML.tree --bs-trees bootstraps.tree --bs-metric TBE

Hope this helps,
Alexey
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/2d5e0d9e-70d7-4871-80d8-eea9f148718do%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/2d5e0d9e-70d7-4871-80d8-eea9f148718do%40googlegroups.com?utm_medium=email&utm_source=footer>.

Grimm

unread,
Oct 1, 2020, 10:06:23 AM10/1/20
to raxml
Hi Sonia,

if increasing the number of starting trees and applying the TBE values does not work, it's because any probabilistic method needs some discriminating tree-like signal in the matrix. You typically get a lot of branches with BS ~ 0 when your data is either:
  • not the product of a tree-like evolution, for instance, virus data including a lot of recombinants and ancestor-descendant pairs will collapse any branch support because each bit of the sequence follows a different tree, and much of the evolution in the data is not the product of dichotomy, the splitting of one into two leaving nothing behind.
  • population-type data with very few, phylogenetically sorted mutations.
I guess, it's the latter, since you said, your tree-space is very flat. Probabilistic methods struggle in such case because they need some information to model: any tree has the same probability under the optimised model. To see whether your data suffice to infer any tree, you can try a quick Bayesian analysis (e.g. with ExaBayes or any other Bayesian programme): when you also have PP << 1 for most branches, then there is no proper signal to infer a probabilistic tree in your data and your only option (and the much better one) are methods used to infer haplotype networks.

The number of distinct alignment patterns that RAxML calculates (DAP) may give you a clue, too. But for population data, you can have relatively high number of DAP, but since you lack the dichotomy-model sorting, you'll be lost utterly in tree-space.

Good roaming of the plains, Guido

SoniaN

unread,
Apr 22, 2021, 5:06:11 PM4/22/21
to raxml
Hello,

I am picking up on an old conversation I started last year on determining the number of starting trees. I am back to this because I ran out of my CPU-hours allocation for this year on CIPRES and am told that one likely reason is that the number of raxml searches I have been doing has been too extreme.

Per the above recommendations from you, back in October I switched to raxml-ng and followed the instructions above about increasing the number of my starting trees. I did some tests to figure out what a good number of starting trees could be. You can see the results of my tests in the attached document where I compare the log likelihoods of my runs as I increased the number of starting trees. Each sheet corresponds to a single locus of about 2K bp for about 400-500 taxa. You’ll need to scroll down to the end of each sheet to see the results and graph. Based on what I learned from this exercise, I decided to set up each of my runs with 500 parsimony starting trees and 500 random starting trees. However, when I ran out of time on CIPRES and asked whether there had been a miscalculation of my hours, I was told that 1) my number of searches in each run have been too extreme and 2) raxml-ng bootstrap runs are slow and take a lot of CPU hours. My question here is, given what you see in the attached document, how many starting trees would you have considered enough to generate my single gene trees?

To me, it seems that each run could be more efficient than it is, such that one could run fewer runs but make each run longer, although I have absolutely no idea if that is possible or how much time it would have saved me. At this point I am working on a series of combined datasets of these loci (plus some more loci added which are not on the spreadsheet). My combined matrices each include ~400 taxa and about 20K bp. I am requesting additional CPU hours to generate my combined trees but CIPRES advised me to consult with you first about whether this many searches are necessary to find the trees. Please advise!
KF_RF.xlsx

SoniaN

unread,
Apr 22, 2021, 5:36:06 PM4/22/21
to raxml
Also, apologies! I forgot to thank Guido for the very helpful explanations and suggestions in response to my original post! The gist of the story is that the PI on this project wants ML trees, so exploring Bayesian analyses or haplotype networks were not options I could explore for this data specifically but I will definitely keep them in mind for other similar situations in the future. Many thanks for your very clear and helpful explanations, Guido. I always learn a lot from your posts. 

On Thursday, October 1, 2020 at 7:06:23 AM UTC-7 Grimm wrote:

Alexandros Stamatakis

unread,
Apr 23, 2021, 12:02:08 AM4/23/21
to ra...@googlegroups.com
Hi Sonia,

> I am picking up on an old conversation I started last year on
> determining the number of starting trees. I am back to this because I
> ran out of my CPU-hours allocation for this year on CIPRES and am told
> that one likely reason is that the number of raxml searches I have been
> doing has been too extreme.

Most probably so.

> Per the above recommendations from you, back in October I switched to
> raxml-ng and followed the instructions above about increasing the number
> of my starting trees. I did some tests to figure out what a good number
> of starting trees could be. You can see the results of my tests in the
> attached document where I compare the log likelihoods of my runs as I
> increased the number of starting trees. Each sheet corresponds to a
> single locus of about 2K bp for about 400-500 taxa. You’ll need to
> scroll down to the end of each sheet to see the results and graph. Based
> on what I learned from this exercise, I decided to set up each of my
> runs with 500 parsimony starting trees and 500 random starting trees.
> However, when I ran out of time on CIPRES and asked whether there had
> been a miscalculation of my hours, I was told that 1) my number of
> searches in each run have been too extreme and 2) raxml-ng bootstrap
> runs are slow and take a lot of CPU hours. My question here is, given
> what you see in the attached document, how many starting trees would you
> have considered enough to generate my single gene trees?

To get an objective answer I would run the bootstrap convergence
criterion post hoc on the file containing all ML trees. It's actually
been designed for assessing how many BS replicates are necessary but it
can also be used to assess the topological stability of a ML tree set,
i.e., if adding more trees would change something.

So my suggestion would be to run this on your collection of ML trees to
obtain a somewhat objective answer (or at least a quantifiable one) if
you have done enough searches or not.

> To me, it seems that each run could be more efficient than it is, such
> that one could run fewer runs but make each run longer, although I have
> absolutely no idea if that is possible or how much time it would have
> saved me.

You can't really make the individual runs longer due to the design of
the search algorithm.

> At this point I am working on a series of combined datasets of
> these loci (plus some more loci added which are not on the spreadsheet).
> My combined matrices each include ~400 taxa and about 20K bp. I am
> requesting additional CPU hours to generate my combined trees but CIPRES
> advised me to consult with you first about whether this many searches
> are necessary to find the trees. Please advise!

See above, I hope this helps a bit.

Alexis

> On Thursday, October 1, 2020 at 7:06:23 AM UTC-7 Grimm wrote:
>
> Hi Sonia,
>
> if increasing the number of starting trees and applying the TBE
> values does not work, it's because any probabilistic method needs
> some discriminating tree-like signal in the matrix. You typically
> get a lot of branches with BS ~ 0 when your data is either:
>
> * not the product of a tree-like evolution, for instance, virus
> data including a lot of recombinants and ancestor-descendant
> pairs will collapse any branch support because each bit of the
> sequence follows a different tree, and much of the evolution in
> the data is not the product of dichotomy, the splitting of one
> into two leaving nothing behind.
> * population-type data with very few, phylogenetically sorted
> mutations.
>
> I guess, it's the latter, since you said, your tree-space is very
> flat. Probabilistic methods struggle in such case because they need
> some information to model: any tree has the same probability under
> the optimised model. To see whether your data suffice to infer any
> tree, you can try a quick Bayesian analysis (e.g. with ExaBayes or
> any other Bayesian programme): when you also have PP << 1 for most
> branches, then there is no proper signal to infer a probabilistic
> tree in your data and your only option (and the much better one) are
> methods used to infer haplotype networks.
>
> The number of distinct alignment patterns that RAxML calculates
> (DAP) may give you a clue, too. But for population data, you can
> have relatively high number of DAP, but since you lack the
> dichotomy-model sorting, you'll be lost utterly in tree-space.
>
> Good roaming of the plains, Guido
>
> alexei...@gmail.com schrieb am Donnerstag, 1. Oktober 2020 um
> 11:16:14 UTC+2:
>
> Hi Sonia,
>
> first of all, if you are still using the "old" RAxML8.x, I
> recommend upgrading to RAxML-NG
> (https://github.com/amkozlov/raxml-ng
> <https://github.com/amkozlov/raxml-ng>), which is also available
> <https://groups.google.com/d/msgid/raxml/2d5e0d9e-70d7-4871-80d8-eea9f148718do%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/raxml/2d5e0d9e-70d7-4871-80d8-eea9f148718do%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/82725b6b-5289-41c5-a5ca-d480e1891d90n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/82725b6b-5289-41c5-a5ca-d480e1891d90n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

Grimm

unread,
Apr 23, 2021, 5:41:13 AM4/23/21
to raxml
Hi Sonia,

looking at the skylines in the spread sheet – which in themselves are very intriguing, as the mimick Bayesian skylines and proof Alexi's point on how important more than 10 starting trees are for particular datasets to get the best-possible topology, possibly speeding also the final optimisation; you should keep them for the supplement – I'll predict (boldly) that some of your gene trees samples have not yet met the convergence criterion Alexi mentioned; specifically trnLF, L73, L194, L373 and L382

However, since your objective is to get ML trees, I would report this (criterion met or not after 1000 starting trees > signal issue leading to inherent topological ambiguity) to illustrate the basic issue with your matrix but not bother to go higher, because it may be a lot to go: note, even your plateaus are slightly inclined.

A reason for the only slight inclination maybe that the first tree of the flattened curve and the last (which may be 1001th or 1 millionth) before reaching stationarity only differ by quite irrelevant branches given your data's signal but already agree in major aspects. But they also show while there is some coherent signal in your data to infer ML trees, it is obscured by a lot of (terminal) noise.

A trick to get nonetheless the ML trees despite obscuring noise is to prune down the full tip set to a subset of meaningful placeholders following the general rule in phylogenetics: the fewer tips, the cleaner the trees. Also the more error-prone, so knowing the phylogenetic groups a priori to selecting proper placeholders is critical. The starting tree samples you generated can serve as a basis to do this.

First run a consensus network of your starting tree samples, excluding those pre-plateau with a high cut-off of 25 or 33, i.e. only edges are kept where the corresponding branch is found in at least every 4th, 3rd of the quasi-plateau starting trees. This will give you the tips in your set that just mess-up (leading to increasing inference time; represented as collapsed fans in the consensus network) while showing the main (signal-wise trivial) groups informed by each gene-region and the corresponding splits (branches).

Then, you just go back to the highest scoring starting tree already generated to pick one (or two) placeholders for each gene's main group, e.g. the shortest and longest-branched in each group. Or just randomly pick one for the group. Idea is to get a tip set that covers all aspects in the consensus network.

You do this for all your genes, and will end up with a smaller and more tree-like (signal-wise) concatenated matrix. There may be different placeholders to keep for each gene due to different groupings, so the number will still be comprehensive enough (i.e. you keep the entire data from each gene-selected placeholder for the concatenated matrix)

 You then can easily infer sensible trees and branch support, gene wise and for the concatendated matrix. This gives you a phylogenetic backbone that will capture all deeper aspects a most-optimal all-tip tree would show.

If you need specifically to know where the filtered tips place (ML-wise), EPA is always a quick option. If you need more resolution, use subsets limited to the data of a group (you then have the required ML trees and only use the networks for ingroup differentiation/details, which should be possible)

This would allow you to have your entire data phylogenetically placed/grouped using ML, without having to bother about the topological uncertainties of a (possibly inevitably) noisy 500 x 20000 tree (did one use enough starting trees or not?). And you can run all analyses on a stand-alone machine (in fact, I ran larger matrices on a tetra-core PC using RAxML, took only a few days using the duplicates-eliminated matrices: but it of course depends how coherent the signal in the individual gene partitions is)

Important application note: Given your gene sample (two coding plastid genes mostly resolving only mid-/deep splits, a noncoding plastid barcode working only in tropical and high-divergent herbaceous plants; I suppose the L... are single-copy nuclear genes, so may generally tell a different story), in addition to tree-obscurity due to lack of discriminating signal – pending what hierarchical level you're looking at, you actually may have ancestors-descendant pairs in addition to random 3rd base pair noise leading to very flat, ambigous subtrees – there may simply be no universal tree for the concatenated matrix at all.

/G.

PPS Together with Markus Göker, Alexi and I once tested the effect of group-based filtering for all availaible (at the time) rbcL data on angiosperms (open access), which can be used as a theoretical basis for why to filter at all: by filtering group-wise, we saw not only an increase in branch support, but also tree resolution, ending up with trees not much different anymore from much more recent complete plastome trees, so apparently better trees than we could infer using the entire (all-tip) data (it's mostly matK; a dedicated Genealogical World post, secunded by rbcL that makes up the plastid backbone for angiosperms; and rpoC2 alone gets most of the full plastome tree, see Walker et al. 2019, PeerJ).

It's always good to assemble broad data, but for many questions, one should really experiment with an all-data backed/informed filtering to get less-tips matrices and hence trees, one can actually work with, and from which main information can then be propagate on the entire data. Filtering is only bad, if one ignores the dropped data entirely.

Alexey Kozlov

unread,
Apr 25, 2021, 8:16:59 PM4/25/21
to ra...@googlegroups.com
Dear Grimm,

huge thanks for doing the heavy-lifting here once again :)

@Sonia, I just want to emphasize that with 2k bp and 400-500 taxa, the chances are very high that
even the "perfect" ML topology (with infinite number of starting trees) will be different from the
(unknown) true topology. Or in other words, the true topology will not have the highest ML score. We
can see this in simulations. Hence, there is no benefit in running 1000s of searches: what looks
like a small likelihood improvement, is most probably just random noise.

Best,
Alexey

On 23.04.21 11:41, Grimm wrote:
> Hi Sonia,
>
> looking at the skylines in the spread sheet – which in themselves are very intriguing, as the mimick
> Bayesian skylines and proof Alexi's point on how important more than 10 starting trees are for
> particular datasets to get the best-possible topology, possibly speeding also the final
> optimisation; you should keep them for the supplement – I'll predict (boldly) that some of your gene
> trees samples have not yet met the convergence criterion Alexi mentioned; specifically trnLF, L73,
> L194, L373 and L382
>
> However, since your objective is to get ML trees, I would report this (criterion met or not after
> 1000 starting trees > signal issue leading to inherent topological ambiguity) to illustrate the
> basic issue with your matrix but not bother to go higher, because it may be a lot to go: note, even
> your plateaus are slightly inclined.
>
> A reason for the only slight inclination maybe that the first tree of the flattened curve and the
> last (which may be 1001th or 1 millionth) before reaching stationarity only differ by quite
> irrelevant branches given your data's signal but already agree in major aspects.But they also show
> *Important application note: *Given your gene sample (two coding plastid genes mostly resolving only
> mid-/deep splits, a noncoding plastid barcode working only in tropical and high-divergent herbaceous
> plants; I suppose the L... are single-copy nuclear genes, so may generally tell a different story),
> in addition to tree-obscurity due to lack of discriminating signal – pending what hierarchical level
> you're looking at, you actually may have ancestors-descendant pairs in addition to random 3rd base
> pair noise leading to very flat, ambigous subtrees – there may simply be no universal tree for the
> concatenated matrix at all.
>
> /G.
>
> PPS Together with Markus Göker, Alexi and I once tested the effect of group-based filtering for all
> availaible (at the time) rbcL data on angiosperms <http://dx.doi.org/10.4137/EBO.S4528> (open
> access), which can be used as a theoretical basis for why to filter at all: by filtering group-wise,
> we saw not only an increase in branch support, but also tree resolution, ending up with trees not
> much different anymore from much more recent complete plastome trees, so apparently better trees
> than we could infer using the entire (all-tip) data (it's mostly /matK/; a dedicated Genealogical
> World post <https://phylonetworks.blogspot.com/2019/10/why-emperor-has-no-clothes-on-mighty.html>,
> secunded by /rbcL /that makes up the plastid backbone for angiosperms; and /rpoC2/ alone gets most
> of the full plastome tree, see Walker et al. 2019, /PeerJ/ <https://peerj.com/articles/7747>).
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/a72e8604-0152-435d-be06-43be774b7638n%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/a72e8604-0152-435d-be06-43be774b7638n%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages