Low bootstrap values and gaining confidence in a topology

470 views
Skip to first unread message

Yoav Shoshan

unread,
Jan 7, 2021, 6:02:27 AM1/7/21
to raxml
Hello,

My Issue may appear very basic. reconstructing trees is far from my daily focus.

I'm using raxml-ng to reconstruct the tree topology of 8 Mollusk species.
I have a large MSA file containing the concatenation of several thousands gene groups MSA's (and a similar file for the protein sequences).

When reconstructing the tree with raxml-ng default parameters I get one distinct topology from 60 bootstrap trees (20 for LG model, 20 for WAG, and 20 for GRT+G with the nucl's MSA) (RF distance = 0 for all 60).
Also, all resulting bootstrap trees for each of those 3 runs end up with the same Likelihood
bsconverge option on one of the best trees show's convergence after 50 replicates with bs-cutoff of 0.1.

However I get very low support values on all of the branches of the best trees (identical to all other trees not considering the root location).
I admit that I haven't quite figured out the meaning of the numbers. I was expecting something like the number of bootstrap trees (out of 100) that supported the branch.
Here, I have numbers usually between 0-1 (not always!, what is the range?), and I guess they represent something like the tendency of the branch to "remain the same"  when resampling columns from the MSA during bootstrapping(?). Would be glad to have a clarification for the definition of those numbers.

I should mention that the genes were assembled de-novo with Trinity (without a reference genome to align the RNA-seq data to). Meaning that many of the genes are actually incomplete fragments, causing the MSA to open many large gaps.
If this can cause a problem - can I just restrict my analysis to MSA locations without many large gaps (say by removing all columns that have X neighbor columns with gaps within Y columns width window or something of that sort?)

I had many questions so far but I guess the main, most important one, is how to gain a measure of confidence in the topology (what is the next best topology? How does it do wrt to this one? etc.)


My best trees for WAG, GTR+G models:
WAG:
(((squ:0.060366,sep:0.057021):0.015237,((oct:0.008162,bim:0.013167):0.187903,(nau:0.236670,apl:0.475046):0.156231):0.123248):0.016502,lin:0.072651,bob:0.086923);  

GTR+G:
(((apl:1.209684,nau:0.449381):0.343308,((bob:0.135055,lin:0.100079):0.023385,(sep:0.079225,squ:0.087726):0.019226):0.227576):0.366793,oct:0.008871,bim:0.015477);  

Best,
Yoav

Grimm

unread,
Jan 7, 2021, 7:28:44 AM1/7/21
to raxml
Hej Yoav,

the < 1 values you are looking at are the branch-lengths, not the branch support values. What you did in your experimental set-up was "just" to infer 20 ML optimised trees (topologies), as recommended by the makers of RAxML to ensure, your inferred tree is the optimal one. But you didn't invoke a bootstrapping to test the branch-support or are looking at the wrong output wrong file. The best-trees you give are naked trees.

In Newick (https://en.wikipedia.org/wiki/Newick_format), branch-support (bootstrap or other) would be given like this:
(squ:0.060366,sep:0.057021)100:0.015237...

Classic RAxML produced three output trees of a full analysis.
  1. The naked tree that you are looking at: RAxML_besttree_...
  2. A tree with bootstraps support mapped: RAxML_bipartitions_...
  3. And a tree with bootstraps supports mapped and "shadow" nodes to avoid BS support "jumping" when re-rooting the tree: RAxML_bipartitionsBranchLabelled_...
(names have changed but you still get the naked, without support values, and branch-labelled, i.e. with support values, with RAxML-NG, too)


Regarding your gap-question: ML is relatively stable against missing data. As long as there are no homology issues, everything fairly aligned, you don't need to filter (not with your very small data set).

Cheers, Guido

Yoav Shoshan

unread,
Jan 7, 2021, 8:44:09 AM1/7/21
to raxml
Thank you Grimm!

I forgot to mention that I did execute raxml-ng with the --bootstrap option, yielding 50 bstrees per each of the 3 models.
Going back to raxML with --support option to compute bipartition support for each of the 3 best trees as reference trees, I get (maybe something like file 2 in your list?):

LG tree:
(((bim:0.013404,oct:0.008383)100:0.198256,(nau:0.251934,apl:0.514292)100:0.166518)100:0.129879,(lin:0.075211,bob:0.089988)100:0.017154,(squ:0.062265,sep:0.058819)100:0.015709);

WAG tree:
(((squ:0.060366,sep:0.057021)100:0.015237,((oct:0.008162,bim:0.013167)100:0.187903,(nau:0.236670,apl:0.475046)100:0.156231)100:0.123248)100:0.016502,lin:0.072651,bob:0.086923);

GTR+G
(((apl:1.209684,nau:0.449381)100:0.343308,((bob:0.135055,lin:0.100079)100:0.023385,(sep:0.079225,squ:0.087726)100:0.019226)100:0.227576)100:0.366793,oct:0.008871,bim:0.015477);

Meaning all the inner branches are 100% supported(?) 

I hope I understand correctly so far.
If so, would you say this analysis (maybe with larger sets of bootstrap trees and a more careful look into the MSA to consider filtering) would be sufficient to support the resulting topology? 

Thanks again,
Yoav

Grimm

unread,
Jan 7, 2021, 8:55:45 AM1/7/21
to raxml
Absolutely right-on. Congratulations, you have a fully supported, fully resolved tree with what I always would just call "unambiguous support". Which is the expectation when combining many characters and few tips. In fact, anything elso would have been a very red warning flag.

" If so, would you say this analysis (maybe with larger sets of bootstrap trees and a more careful look into the MSA to consider filtering) would be sufficient to support the resulting topology? "

Since the 20 trees are identical (topologically), and amply scoring, and their branching patterns are unambiguously supported: absolutely. You can't do better: there is only one possible tree for your data set, and you found it.

Re: "larger set of BS tree": one always tries the implemented bootstop criterion to ensure a sufficient number of BS replicates behind the values, but in your case, it probably would stop with 50 anyway.

Yoav Shoshan

unread,
Jan 7, 2021, 9:03:54 AM1/7/21
to raxml
Thank you very much for the quick responses and the cheering words! 
Yoav

Lucas Czech

unread,
Jan 7, 2021, 1:18:21 PM1/7/21
to ra...@googlegroups.com

Dear Yoav,

just to add my two cents here: Guido (Grimm) already mentioned how branch support values can be added to Newick tree files. However, the Newick format did not originally intend to have such extra annotations, and so this is more of a hack - and to make it more complex, there are different variants of that hack. Hence, when visualizing the tree with support values on it, you need to be aware of which hack was used - and in particular, when you re-root the tree, errors can occur when this is not taken into account. See here for details: https://academic.oup.com/mbe/article/34/6/1535/3077051

That being said, as you have "unambiguous support" in your tree, not much can go wrong, as all the support values are the same anyway. But still, I figured that more people should be aware of the issue, to prevent future mistakes ;-)

Cheers and all the best
Lucas

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/bc8cc271-2708-4934-b6ba-e006c665ff58n%40googlegroups.com.

Grimm

unread,
Jan 7, 2021, 2:10:58 PM1/7/21
to raxml
One cannot repeat that and point to that article often enough.

There are still many who misinterpret bootstrap values (or posterior probabilities) as "node support".

Which is what the NEWICK hack Lucas mentioned does. It transfers a branch (internode) support, the frequency of a taxon bipartition to a node in an implicitly rooted representation of the unrooted tree we actually infer. And since NEWICK is older than phylogenetics, it can only encode rooted trees instead of giving the sets of taxon bipartions, splits, that make up the tree. As consequence, the estimated branch support always counts for two nodes we could encode in the NEWICK format.

BasicsNodesAndStuff.png

To transfer the inferred tree into NEWICK, we need to assume a root.
E.g. (((A, B), C), D)
With support
(((A, B)99, C), D)
But this is the same as if we would write
((A, B)99, (C, D)99))   or   (A, (B, (C, D)99)))

Hence, the "shadow" nodes in the original bipartitionBranchLabelled RAxML output trees, for the picture example we place a shadow-node in the middle of the pink branch and link the 99 to that node; which allow rerooting, e.g. in Dendroscope, without risking misplacing the values.

Note that since this additional node is only the contact point of two (the upper and lower half of the pink internode) and not three internodes like the actual nodes 1 and 2, not all tree-readers can interprete the bipartionBranchLabelled NEWICK file. (Dendroscope can because it has also been designed for explicit phylogenetic networks, which include nodes connecting for internodes as well as direct ancestor-descendant pairs; imagine B not as a tip but being placed at node 1)

Yoav Shoshan

unread,
Jan 9, 2021, 11:24:36 AM1/9/21
to raxml
Thank you Lucas and Guido for that clarification.
I will take your recommendation for those viewers.

But just to make sure I have everything clear, taking the newick format of my WAG tree as an example:
(((squ,sep)100,((oct,bim)100,(nau,apl)100)100)100,lin,bob);
  
In my case, there is a rigid consensus location the root on the branch connecting apl to its ancestral node. the rooted Newick tree is now:
(apl,(nau,((oct,bim)100,((sep,squ)100,(bob,lin)100)100)100);

Is this correct?
I assume that since the support values are 100  per each node (actually the branch connecting it to its neighbor node in the direction of the trifurcation), the visualized rooted tree will have "100" on each inner branch, and I wouldn't have to worry about transforming the support values on the branch now connecting the root to the subtree (nau,((oct,bim)100,((sep,squ)100,(bob,lin)100)100);

Best,
Yoav

Lucas Czech

unread,
Jan 9, 2021, 1:35:57 PM1/9/21
to ra...@googlegroups.com

Dear Yoav,


the rooted Newick tree is now:

(apl,(nau,((oct,bim)100,((sep,squ)100,(bob,lin)100)100)100);

Is this correct?

Not quite, there seems to be a parenthesis missing at the end. How did you obtain this tree? I would not recommend editing Newick trees by hand - way too error prone! But with an additional closing parenthesis at the end, that tree is equivalent to the one I get from rerooting at "apl" with Dendroscope, so it looks okay.


I assume that since the support values are 100  per each node (actually the branch connecting it to its neighbor node in the direction of the trifurcation), the visualized rooted tree will have "100" on each inner branch, and I wouldn't have to worry about transforming the support values on the branch now connecting the root to the subtree (nau,((oct,bim)100,((sep,squ)100,(bob,lin)100)100);

I am not quite sure what you mean by this. As Guido Grimm said, support values are not "per each node" - they are per branch/bipartition/split/internode.

In your tree, the support value for that clade that connects to the root (which is not a top-level trifurcation any more, since you now rooted it) is missing its support value. But that is correct, because it connects to a single leaf node on the other side of the root ("apl"), and single nodes usually do not have support values (because their split/bipartition is included in every tree anyway, so it does not make much sense to add them). If instead of just "apl" the other side of the root contained more taxa, you would want to have a support value there - and in fact, it would have to be added to both sides of the branches of the root node, because now that the tree is rooted, the two branches leading away from the root are in fact one branch only, which is just drawn as two branches. See Fig 2(c) of the article (https://academic.oup.com/mbe/article/34/6/1535/3077051) for an example.

Hope that helps
Lucas

Yoav Shoshan

unread,
Jan 9, 2021, 2:42:34 PM1/9/21
to raxml
Yes that helps a lot.
The tree is supposed to be closed with another parentheses, thank you for that catch.

I'm sorry for the confussions in terminology, as I mentioned, this field is not within my daily focus.
I interprated the support values per branch leading from the node the value is addigned to (in the newick tree), to the neighbor node (along the path to the trifurcation), and probably sought clarification for this interpretation.

Lucas Czech

unread,
Jan 9, 2021, 4:01:31 PM1/9/21
to ra...@googlegroups.com
Great, glad I could help ;-)

Yes, that sounds like the correct way to approach this: The support values in your tree are at "nodes" of the tree, but are meant to be interpreted as values for the branch that leads from that node towards the root/top-level-trifurcation.

Cheers and so long
Lucas

Grimm

unread,
Jan 11, 2021, 4:26:26 AM1/11/21
to raxml
A related reading tip: Because so many in phylogenetics get this wrong (and far too many supervisors, too; there was a related question on ResearchGate some time ago), I posted my answer to the question "How to interpret bootstrap values?" on my Res.I.P. blog including very simple case graphics, designed for newbies to the wonderful world of phylogenetics (or misinformed veteran systematic zoologists and botanists, who very often confuse branch for node support).

Cheers, Guido

PS As Lucas said: never edit a NEWICK file by hand! If you want to re-root without using e.g. shadow-node-forma, it's very easy to avoid any error by
  1. re-rooting the naked(!) RAxML tree (without any support values) in a tree viewer (such as Dendroscope or FigTree), 
  2. export this re-rooted variant as NEWICK, and
  3. then just re-map the supports onto that tree based on your original RAxML bootstrap pseudoreplicate sample (or the Bayesian sampled topologies, for that matter) using the according function in RAxML (or with fitting R packages).
Reply all
Reply to author
Forward
0 new messages