Spurious taxon placement after removing one of 108 taxa

11 views

Skip to first unread message

marybel soto gomez

unread,

Oct 7, 2019, 10:49:26 PM10/7/19

to raxml

Hi all,

I am posting a question because I could not find a similar case in previous posts.

I am running a ML tree search (T = 20) for a DNA matrix that consists of 107 taxa and 80,390 bp. One taxon has an unexpected placement in the resulting best tree, and 500 bootstrap replicates show 3% support for this relationship. In contrast, the placement for this taxon that I was expecting has 97% bootstrap support. Looking at the parsimony trees for individual runs, the taxon is always in the expected placement, but it ends up in the unexpected placement in the "result.tre.RUNXX" tree from these runs.

I have used the same methods to run slightly different versions of this DNA matrix, removing 1-7 taxa to see how that affects topology. When I remove 1-6 taxa, the abovementioned taxon always goes in the expected place with high bootstrap support. It's when I remove a 7th taxon that its placement changes and is poorly supported. By the way, I'm removing the 7th taxon because it is represented by very few bp compared to most other taxa in the matrix.

I hope my question is clear. I would appreciate any thoughts anyone might have on this. Thanks in advance! All the best,

Marybel

Grimm

unread,

Oct 8, 2019, 9:07:35 AM10/8/19

to raxml

Hej,

I am running a ML tree search (T = 20) for a DNA matrix that consists of 107 taxa and 80,390 bp. One taxon has an unexpected placement in the resulting best tree, and 500 bootstrap replicates show 3% support for this relationship. In contrast, the placement for this taxon that I was expecting has 97% bootstrap support.

This points to a signal issue. Moving this taxon, the rogue, may not change the overall likelihood of the tree. Hence, RAxML just picks one tree, even if it's a suboptimal one regarding character support.

I have used the same methods to run slightly different versions of this DNA matrix, removing 1-7 taxa to see how that affects topology. When I remove 1-6 taxa, the abovementioned taxon always goes in the expected place with high bootstrap support. It's when I remove a 7th taxon that its placement changes and is poorly supported. By the way, I'm removing the 7th taxon because it is represented by very few bp compared to most other taxa in the matrix.

The reason for this may be that the data you have for taxon 7 inflicts topological conflict: e.g. if taxon 7 is most-similiar to the rogue, they will be joined in the tree (worst-case because of the 7% that do not support the preferred relationship), but if it doesn't fit within the 1-6 group otherwise, to which the rogue should go according to the bootstrap analysis, they both will be placed apart but with low BS support. Note that all ML implementation treat missing data as Ns = A, C, G or T, the terminal probability vector is p(1,1,1,1).

When you have four taxa like this

taxon1 A G A G

taxon2 G A A A

taxon3 N N T A

taxon4 N N C G

The most likely tree may be (1+2) (3+4), even if the true tree is (1+4) (2+3)

The quick fix is to remove taxon 7 with the poor data coverage (and other taxa with poor data coverage). If you want to know where it (they) belong, use EPA (evolutionary placement algorithm) to optimise its (their) position within the tree not including it.

The comprehensive investigation would be to make a full single-gene analysis of your DNA matrix to see which gene support the overall topology and if there are some in conflict with it. You can summarize the tree sample using the increasingly fashionable cloudograms (a nice example from the common press: https://www.theguardian.com/science/punctuated-equilibrium/2011/nov/02/hawaiian-honeycreepers-tangled-evolutionary-tree) or the long-known but underused supernetworks/ consensus networks (see e.g. http://dx.doi.org/10.1111/2041-210X.12760, open access).

When you opt for the latter (exploring gene support for your overall tree), this is what you could do.

You infer single-gene trees and establish bootstrap support for every gene partition in your data.

Some of them may be star-like or unresolved due to the lack of discriminating signal. You may notice additional rogues, taxa not covering all gene regions jumping in the gene trees (worst-case: with high support). This is, when your data is from the mitochondrium or chloroplast, usually due to resolution issues in the individual genes. For instance, a taxon only covered for high-conserved gene regions is hard to place: there will be several equally probable but incongruent solution.

While a taxon covered for a few higher-conserved but informative gene regions and one most variable one that becomes fuzzy towards the root or tips of your tree (inviting branch-attraction artefacts) will just follow that variable region's affinity, in the worst case outcompeting a more sensible single in the less-conserved regions.

To compare the topologies gene-wise in a quick way, one is best-advised using equal-tip taxon-sets. If there are no signal issues beyond missing data, they should not conflict with the overall (combined) tree. Equal tip trees are also needed when you want to summarize them using consensus networks (not sure about cloudograms, never used them myself, I guess they should work with different tip-sets since it's just a visualisation)