Suspected LBA but high bootstrap value

Kenta Renard

unread,

May 30, 2024, 1:31:17 PM5/30/24

to raxml

Dear All,

Thank you Guido and Alexis for the replies/help with my recent questions, I really appreciate it. I just wanted to ask a few question about some results I had.

What are the mechanisms behind incorrectly inferred taxonomic relationships (driven by branch attraction to distant taxa) if the support for the false branch is high? I ask this because of 2 situations I had:

1) A very long branch was located in the correct clade but in a more internal node (branch support of 93%). Removal and placement with EPA-ng placed it as a lone sister taxon to the clade it was originally inferred to be in (100% LWR) and the branch length was halved. What is the true topology here?

2) 8 taxa known to be a sub-clade within a bigger one (lets say clade A) were incorrectly inferred to be part of clade B (branch support of 99%) despite the short branch lengths. Removal of the the 8 taxa and placing them back with EPA-ng produced the correct/known topology (average LWR of around 85% but one with 54%). I am re-inferring a tree with more taxa from clade B (to try and remove attraction artifacts) but I cannot increase the sampling of the group the 8 taxa belong to. Apart from poor sampling of the outgroup (clade B), what else could contribute to this attraction?

My understanding is that topological inconsistencies between a set of inferred ML trees (and therefore branch supports) should show in the support values.

The over-arching questions are:

1) What causes these phenomena of incorrect branch attraction when the branch supports are still high?

2) To what extent can "a posteriori" tip-pruning yield an accurate tree (how how deep should I prune?) Or is it better to remove the taxa from the alignment and re-infer?

Thank you again for you help.

Best wishes,

Kenta

Alexandros Stamatakis

unread,

May 31, 2024, 2:35:56 AM5/31/24

to ra...@googlegroups.com

Dear Kenta,

> 1) A very long branch was located in the correct clade but in a more
> internal node (branch support of 93%). Removal and placement with EPA-ng
> placed it as a lone sister taxon to the clade it was originally inferred
> to be in (100% LWR) and the branch length was halved. What is the true
> topology here?

We don't know the true one, but you do seem to have a gut feeling about
what is the most biologically plausible one. So it seems that here, as
with distant outgroups, the sequence sitting on this long branch seems
to perturb the ingroup inference. So I would remove this sequences,
infer a tree on the remaining sequences and then place the sequence onto
the tree with EPA. I would definitely report what happened in the paper
you will write about this as this can be a very valuable dataset for
analyzing this problem.

> 2) 8 taxa known to be a sub-clade within a bigger one (lets say clade A)
> were incorrectly inferred to be part of clade B (branch support of 99%)
> despite the short branch lengths. Removal of the the 8 taxa and placing
> them back with EPA-ng produced the correct/known topology (average LWR
> of around 85% but one with 54%). I am re-inferring a tree with more taxa
> from clade B (to try and remove attraction artifacts) but I cannot
> increase the sampling of the group the 8 taxa belong to. Apart from poor
> sampling of the outgroup (clade B), what else could contribute to this
> attraction?

Hard to tell, how many ML trees have you inferred, do these 8 taxa
always end up in clade B for all ML trees you have inferred?

> My understanding is that topological inconsistencies between a set of
> inferred ML trees (and therefore branch supports) should show in the
> support values.

Yes, if there are inconsistencies, but you should also check the
distinct ML trees and what they show. Further, you may want to do a
constrained ML search with the grouping you think is correct and compare
ML scores of the resulting trees via standard phylogenetic likelihood
significance tests.

> The over-arching questions are:
>
> 1) What causes these phenomena of incorrect branch attraction when the
> branch supports are still high?

Maybe the long branches still maximize the likelihood?

> 2) To what extent can "a posteriori" tip-pruning yield an accurate tree
> (how how deep should I prune?) Or is it better to remove the taxa from
> the alignment and re-infer?

100% re-infer :-)

Alexis

>
> Thank you again for you help.
>
> Best wishes,
> Kenta
>

> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/2f595843-2fb6-4b95-8401-07174af6438cn%40googlegroups.com <https://groups.google.com/d/msgid/raxml/2f595843-2fb6-4b95-8401-07174af6438cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

ERA Chair, Institute of Computer Science, Foundation for Research and
Technology - Hellas
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.biocomp.gr (Crete lab)
www.exelixis-lab.org (Heidelberg lab)

Grimm

unread,

May 31, 2024, 2:15:01 PM5/31/24

to raxml

Hi Kenta,

what Alexi wrote regarding the questions :) Just a few more explanations from the biological applicant's viewpoint regarding the two topological phenomena you're facing:

Re Phenomenon 1) Assuming the long-branched tip is the only one in its clade, makes LBA for its internal placement in the inclusive tree unlikely, because if LBA, it should be attracted by the the rest of the tree (the relative "outgroup"), drawn away from its siblings (the relative ingroup). Pending the actual signals in the underlying data, in some cases EPA may fall for the LBA while the inclusive ML tree escaped it when optimising the clade's structure. It depends a lot on the genetic coherence of the group it associates with.

A thing that doesn't fit easily in this explanation is the halved branch length in EPA. Not knowing the topology of the subtree (are the remaining branches highly supported or not when we include the long-branched tip? How coherent is the clade including or excluding the long-branched tip?), it may be that the tip branch is inflated in the inclusive tree because you forced somewhat non-treelike signal into a tree. The inclusive ML-tree inference and EPA may, pending the gene-wise signal of the distinct tip, prefer a placement on an internal branch fitting the majority and minority signal, while EPA solves this problem by placing it outside, because the likelihoods worsen when putting it somewhere else because it doesn't really fit to any other tip in the subtree. Based on the EPAs I did, EPA rather "basals" tips with ambivalent signals than splitting the LWRs. Only distinct F1-Hybrids may get split LWRs with their two parental and differentiated parents. Note that signal conflict only shows easily in the BS supports when using oligogene matrices based on genes with strong signals (for a hybrid, if you have two genes, one showing each parentage with unambiguous support, the 2-gene tree has a good chance for a ~50:50 split in the BS support); in multigene trees it can be easily watered down. For instance, in the beeches, where I looked at a 28 nuclear gene set, the 3 genes supporting a profoundly different phylogeny and rejecting the coalescent and combined tree capturing ancient introgression from one lineage into a cousin lineage (Supplement Data S5 to Cardoni et al. 2021), were completely knocked out in the combined tree: in the 28-gene tree, all branches received unambiguous support. Whether conflict surfaces in BS supports always depends on the signal from the conflicting genes/partitions, the number of perfectly behaving and roguish tips and which branches are in conflict. In the case of the beech, the sister relationship unambiguously supported by those three genes conflict with the entire backbone of the combined tree (= coalescent) and literally all sorted character splits in the remaining 25 genes. So, by bootstrapping the total data set, the chance of retaining enough of the alternative-supporting alignment-patterns and get an according pseudoreplicate topology is ~0.

A hint for non-treelike signal (roque-tips) are the BS of all branches in the according subtree: if they generally decrease when the tip is included but become unambiguous when its excluded then, in the light of the EPA result, this is a strong indication that this long-branch tip is a bit roguish because its signal doesn't fit with the assumption of a strictly dichotomous evolution which we model when inferring a tree. In short, the problem you face may not be LBA but non-treelike evolution, or non-treelike signal in the data induced by the long-branched tip (also evolution not involving introgression/hybridisation may generate non-treelike signals!)

Re phenomenon 2) I would need to see the tree's topology, but I guess it's either what I call "short-branch culling" in course of a rooting (outgroup-induced) artefact. Or that you misinterpreted the support values because of e.g. the Newick-inherent glitch; i.e. your analysis actually didn't "incorrectly inferred" them as part of clade B but the errors lies in how we interpret that branch support of 99%. Given the EPA-results, I tend to the latter. You will directly see, we you have re-inferred with the larger B sample and the issue remains with the larger outgroup.

Cheers, Guido

Kenta Renard

unread,

Jun 3, 2024, 11:23:25 AM6/3/24

to ra...@googlegroups.com

Dear Guido and Alexis,

Thank you for the very helpful answers as usual.

Only distinct F1-Hybrids may get split LWRs with their two parental and differentiated parents.

Regarding this, my sequences are prokaryotic so I'm not sure this could be the reason. In this case, what could be causing split LWRs?

Re phenomenon 2) I would need to see the tree's topology, but I guess it's either what I call "short-branch culling" in course of a rooting (outgroup-induced) artefact. Or that you misinterpreted the support values because of e.g. the Newick-inherent glitch; i.e. your analysis actually didn't "incorrectly inferred" them as part of clade B but the errors lies in how we interpret that branch support of 99%. Given the EPA-results, I tend to the latter. You will directly see, we you have re-inferred with the larger B sample and the issue remains with the larger outgroup.

Regarding this, I found that if the outgroup contained certain taxa the troublesome 8 taxa would be attracted to them even though we know this is biologically incorrect. Interestingly, the known outgroup sequence the 8 taxa were attracted to belong to the same genus as the 8 taxa (say genus A) (i.e. ingroup sequence attracted to outgroup sequence belonging to the same genus). If the outgroup sample size is reduced (and doesn't include genus A) then the 8 taxa are inferred to be within the biologically correct clade. I will try inferring ML trees with the same alignment but without the genus A outgroup sequence and see if the 8 taxa are still attracted to the outgroup.

Best wishes and thanks again,

Kenta

Grimm

unread,

Jun 3, 2024, 11:42:47 AM6/3/24

to raxml

Hi Kenta,

in case of prokaryotic data, split LWRs may have different reasons:

uncertainty, lack of discerning signal regarding a local aspect of the topology, e.g. a fast ancient radiation or flat terminal subtrees. In this case, the split LWR will be to closeby internodes/internodes of the same clade
pseudogeny (degraded sequences), the split-LWR can be to not directly related tips or not-neighboured internodes
quasi-ancestor-descendant situations, they are easy to depict because an ancestral (or less evolved) query has often a split LWR for the root and the subsequent tip/subtree including its descendants (more evolved siblings of the reference)

Regarding the Misplaced-8, yes, that's a sensible way to look into this. It's quite common that only some members of an outgroup trigger ingroup-outgroup attraction. Less can be more, also in phylogenetics 😉

/G

Reply all

Reply to author

Forward