Computing placement distances

171 views
Skip to first unread message

ear...@gmail.com

unread,
Aug 21, 2017, 6:52:05 AM8/21/17
to raxml
Hi everyone,

Berger & Stamatakis (2010) describe the "node distance" and the "branch distance" to compare alternative positions of a single leaf in trees of otherwise identical topologies. Is it possible to compute these distances with RAxML between pairs of user-supplied trees? (similar to the `-f r` option for the Robinson-Foulds distance).

If I understand correctly, the `-f S` option uses the same node distance to quantify the site placement bias, so RAxML can compute that distance, but it seems there is no [documented] command to do it with user-supplied trees.

Best,

Eduardo

Lucas Czech

unread,
Aug 21, 2017, 8:48:32 AM8/21/17
to ra...@googlegroups.com

Hi Eduardo,

you are referring to the paper "Accuracy of Morphology-based Phylogenetic Fossil Placement under Maximum Likelihood".

If I understand you correctly, your input are two trees (in Newick format?) that are exactly identical except for the position of one taxon, right? Are they also identically rooted and rotated (i.e., both ladderized)? Or are you by any chance working with phylogenetic placement of sequences, i.e., pplacer or RAxML-EPA, and have Jplace files as input? That makes it easier to find the one differing taxon (which would be a placement then, instead of an actual branch on the tree). This also makes it possible to run the whole thing for many different single taxa (each of them placed on the same constance reference tree).

I don't know whether RAxML has an option for this - Alexis can answer this question better. But could you maybe post two exemplary files? I have an idea how to easily implement that. Don't know when I get to do this, but sounds like a fun thing to implement.

So long
Lucas

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ear...@gmail.com

unread,
Aug 21, 2017, 9:50:24 AM8/21/17
to raxml
Hi Lucas,

Thanks for your answer. I quote-reply for clarity.


you are referring to the paper "Accuracy of Morphology-based Phylogenetic Fossil Placement under Maximum Likelihood".

Yes, exactly.


If I understand you correctly, your input are two trees (in Newick format?) that are exactly identical except for the position of one taxon, right?

Yes.


Are they also identically rooted and rotated (i.e., both ladderized)?

I can make the trees be identically rooted and rotated if this is convenient or necessary.


Or are you by any chance working with phylogenetic placement of sequences, i.e., pplacer or RAxML-EPA, and have Jplace files as input?

I am thinking of using RAxML-EPA (I have a morphology partition, which I think pplacer cannot handle), but I'd also use other  placement methods with other software. The common output for all the programs would be newick strings, so the ideal would be to have a general solution that takes trees directly supplied by the user.


But could you maybe post two exemplary files? I have an idea how to easily implement that. Don't know when I get to do this, but sounds like a fun thing to implement.

Sure, I am attaching two small trees and R code to generate them at the end of this message. The trees differ in the position of "t5". If I understand correctly the way nodes are supposed to be counted, the nodal distance between those trees would be 5.

It would be great if you could implement a solution for this problem, if RAxML doesn't have an option to do it. But please don't feel pressured! :)

Cheers,

Eduardo

```
library(phytools)
set.seed(12)
tr1 <- rtree(10)
tr1 <- root(tr1, "t1", resolve.root = TRUE)
tr1 <- ladderize(tr1)
plot(tr1)

# In the random generated tree tr1, "t5" is sister to "t10".
# I will prune "t5" and graft it in a position sister to "t2".

tr2 <- drop.tip(tr1, "t5")

# Find out the node number of "t2" in the pruned tree tr2
tr2$tip.label # The node number of "t2" is 1

tr2 <- bind.tip(tr2, "t5", where = 1, position = 0.5, edge.length = 0.1) # position and edge.length are arbitrary

# Visualise both trees against each other
cophyloplot(tr1, tr2)

# Save the trees in Newick format
write.tree(tr1, "tree1.tre")
write.tree(tr2, "tree2.tre")
```
tree1.tre
tree2.tre

Alexandros Stamatakis

unread,
Aug 21, 2017, 11:34:26 PM8/21/17
to ra...@googlegroups.com
Dear Eduardo,

> Berger & Stamatakis (2010) describe the "node distance" and the "branch
> distance" to compare alternative positions of a single leaf in trees of
> otherwise identical topologies. Is it possible to compute these
> distances with RAxML between pairs of user-supplied trees? (similar to
> the `-f r` option for the Robinson-Foulds distance).

Unfortunately not, but the RF distance will give you a reasonable result
for comparing these, but essentially this will give you something that
is equivalent to the node distance.

> If I understand correctly, the `-f S` option uses the same node distance
> to quantify the site placement bias, so RAxML can compute that distance,
> but it seems there is no [documented] command to do it with
> user-supplied trees.

Yes, but implementing this for pairs of trees is not that
straight-forward. As we are currently re-designing RAxML this should be
better handled via some post-analysis tool. I think Lucas can help you
here since he is working on such a tool and methods for post-analyzing
placements.

Alexis

>
> Best,
>
> Eduardo
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

ear...@gmail.com

unread,
Aug 22, 2017, 5:22:05 AM8/22/17
to raxml
Thanks, Alexis. The node distance seems the most natural measure for this kind of comparison, but I intuitively see how RF is similar.

Since 2*(n-3) is the maximum RF distance in the space of all possible topologies, how would you normalise RF for the single species placement problem, where the topological space is more restricted? Something I did before was to generate the trees with all the possible resolved placements of the query species to find the maximum possible RF distance. That is a bit cumbersome, maybe there is a better way I'm ignoring.

Cheers,

Eduardo

Alexandros Stamatakis

unread,
Aug 23, 2017, 12:39:21 AM8/23/17
to ra...@googlegroups.com
Dear Eduardo,

> Thanks, Alexis. The node distance seems the most natural measure for > this kind of comparison, but I intuitively see how RF is similar.
>
> Since 2*(n-3) is the maximum RF distance in the space of all possible
> topologies, how would you normalise RF for the single species placement
> problem, where the topological space is more restricted?

Without having properly thought about this, if the node distance of the
placement is 1, there will be two bipartitions that will become unique
to either tree, hence the raw RF distance should be divided by two.

Analogously, the further the placement is away the more bipartitions
should become unique along the path from one placement position to the
other. I am not 100% sure what I am saying is correct, but that's the
intuition I would follow. One could then probably prove this via induction.

> Something I did
> before was to generate the trees with all the possible resolved
> placements of the query species to find the maximum possible RF
> distance. That is a bit cumbersome, maybe there is a better way I'm
> ignoring.

This should just be the longest path between two terminal branches in
the tree.

Alexis
> > an email to raxml+un...@googlegroups.com <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>

Lucas Czech

unread,
Aug 23, 2017, 9:52:27 AM8/23/17
to ra...@googlegroups.com, Alexandros Stamatakis
Hi Eduardo,

so, our group pondered over this problem (because it is interesting and
fun), and ran into some issues.

If the taxon is known, calculating the distances is trivial. So we tried
to find an algorithm which finds the moved taxon, given the two trees.
This generally however does not seem to have a unique solution. Thus, we
were wondering what to do in those cases. The resulting edge distance
can differ for the different solutions.

The easiest case would be to give the moved taxon name as a parameter to
the program for calculating the distance. But I am not sure whether you
have that information for all your pairs of trees. What do you want to
do in those cases?

Lucas

ear...@gmail.com

unread,
Aug 23, 2017, 10:22:23 AM8/23/17
to raxml, alexandros...@gmail.com
Hi Lucas,

It is perfectly fine for me to give the query species name as a parameter.

Eduardo

ear...@gmail.com

unread,
Aug 23, 2017, 10:24:19 AM8/23/17
to raxml
Thanks Alexis!

Lucas Czech

unread,
Aug 23, 2017, 11:46:08 AM8/23/17
to ra...@googlegroups.com

Okay, that makes it way easier ;-)

Next question: I had a look at your example trees. They seem to have exactly the same branch lengths, and even the parts that get divided by the moving taxon in one tree add up to the corresponding branch length on the other tree. So, in this example, calculating the branch distance is doable, because there are no inconsistencies (except for rounding of the numbers). But can we generally assume that this is the case? What if not?

Also, this leads me to some meta-questions: How did you obtain those trees? If they result from different bootstrap runs or the like, I'd expect different branch lengths. Also, why is it that only one taxon moves around? Maybe understanding this helps me in understanding how to solve those implementation questions.

Thanks, so long
Lucas

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.

ear...@gmail.com

unread,
Aug 24, 2017, 5:40:20 AM8/24/17
to raxml
Hi Lucas,

The reason why a single species is moving around is because I want to apply the leave-one-out test like the ones from Berger & Stamatakis (2010) and Berger & al. (2011), where every species in a fixed reference tree is pruned out and then re-inserted, and distance between the original position of the query species and its reinsertion point is a measure of the topological error of the phylogenetic placement method being used.

My ultimate goal is to insert a fossil species in a molecular tree. Because the morphological data seems noisy, I want to try out different character weighting methods and use a leave-one-out test to compare their performance. The weighting methods would include the one from Berger & Stamatakis (2010), and a few parsimony-based ones.

I am still working on the character coding in the morphological matrix, so I haven't produced empirical trees at this point. The trees that I sent you were just artificial examples with dummy branch lengths. I think I would use EPA or simple RAxML searches with a skeletal constraint, so whatever changes in branch lengths those methods induce I would expect to be accounted for in the branch length placement distance. However, because the branch length placement distance doesn't make much sense for comparing results between ML and parsimony placements, this distance is not that important for my specific purposes.

Am I making sense with all that? If you want, I can try to produce a few empirical trees later today.

Cheers,

Eduardo

Lucas Czech

unread,
Aug 24, 2017, 12:17:41 PM8/24/17
to ra...@googlegroups.com

Hi Eduardo,

thanks for the detailed answer!


The reason why a single species is moving around is because I want to apply the leave-one-out test like the ones from Berger & Stamatakis (2010) and Berger & al. (2011), where every species in a fixed reference tree is pruned out and then re-inserted, and distance between the original position of the query species and its reinsertion point is a measure of the topological error of the phylogenetic placement method being used.
Okay, that sounds reasonable. So this would give you two trees to compare, the original one (always the same) and one after re-inserting each taxon, so that, for each taxon, you can calculate the distance from where it ended up in the re-insertion-tree to where it originally was. Correct?


My ultimate goal is to insert a fossil species in a molecular tree. Because the morphological data seems noisy, I want to try out different character weighting methods and use a leave-one-out test to compare their performance. The weighting methods would include the one from Berger & Stamatakis (2010), and a few parsimony-based ones.
So you want to use a similar alignment as shown on the first page of Berger & Stamatakis "Accuracy of Morphology-based Phylogenetic Fossil Placement under Maximum Likelihood", where the first columns correspond to morphological characters, and the rest columns correspond to their molecular data? Thus, the molecular columns would be empty for fossils?


I am still working on the character coding in the morphological matrix, so I haven't produced empirical trees at this point. The trees that I sent you were just artificial examples with dummy branch lengths. I think I would use EPA or simple RAxML searches with a skeletal constraint, so whatever changes in branch lengths those methods induce I would expect to be accounted for in the branch length placement distance. However, because the branch length placement distance doesn't make much sense for comparing results between ML and parsimony placements, this distance is not that important for my specific purposes.
Hm all right. If you are using EPA for placing your fossils, the calculation of distances becomes a bit easier, as the tree is constant. That is, instead of having to find the "moved" taxon on two different trees with (potentially) different branch lengths, we have a fixed tree.

If I understand you correctly, the node and edge distance that you want to calculate here are calculated between the placement positions of your fossil when placing it with ML and with parsimony, respectively. Right? So, if my previous assumption about the shape of the alignment is right, you are only using the morphological characters for this, right? This might really mess up the branch lengths... Maybe Alexis can tell you a bit more about this.

Now I drowned you in even more questions, sorry for that!
Lucas

Grimm

unread,
Aug 25, 2017, 8:25:00 AM8/25/17
to raxml
Hi Eduardo,

Aside that it may be an computational-interesting problem, are you sure you need these distances/comparisons at all?


My ultimate goal is to insert a fossil species in a molecular tree. Because the morphological data seems noisy, I want to try out different character weighting methods and use a leave-one-out test to compare their performance. The weighting methods would include the one from Berger & Stamatakis (2010), and a few parsimony-based ones.

With "noisy" you probably mean your morphological data are not tree-like. Meaning: you have a limited number of character patterns that are compatible with the true tree, or the molecular tree in your case, whereas most are not or at least partly conflicting with the branches in the tree. This is what EPA deals with in a probabilistic framework using the pre-defined tree as a guide. It (can) weigh(s) the characters of the morphological partition to find a weighting scheme that - ideally - would allow to infer the input tree based on the morphological data partition, and then places the query at the most likely position.

I have not updated myself on parsimony weighting procedure for quite some time, but this is also what parsimony weighting methods were designed for: you have a tree or assume that there is a single tree that explains the (morphological) diversity patterns in your matrix, and then you downweigh all characters that are incompatible with that tree. So, in your case, the best you can get with a parsimony weighting (from a theoretical point of view) is something similar to the EPA result. So what do you hope to learn from the placement distances?
Finding the best parsimony weighting scheme, the one with the lowest distance to the EPA result? But if this is the logic, why using parsimony weighting at all?!
And if the logic is that EPA is not the most optimal solution for this particular problem, what will it tell you that parsimony weighting method A shows a higher placement distance to EPA than method B? Are you planning to make simulations with morphological partitions increasingly incompatible with the true or molecular tree to test which method performs best?

The only goal I see with such an experiment is to classify parsimony weighting methods and estimate their insufficiency to approach the EPA result. But you just want to place your fossils, right?

Frankly, I would not worry about applying any parsimony weighting scheme. If you have a good molecular guide tree and want to place your fossils individually in that guide tree, you just use the EPA algorithm. If the reviewers mourn, you just do a quick parsimony-optimisation to please them (can also be done with RAxML) and then discuss, why in some cases the most-parsimonious placement is not the most likely one (because it didn't cope with data patterns incompatible with the molecular tree).
And if you want to study the signal in your morphological partition with respect to the fossils. Well, since the data are non-treelike, you anyway should not rely on tree inference but do network-based exploratory data analysis, so also no point in parsimony-weighting here.
For total evidence, you just used the EPA-generated weights for weighting the morphological partition.

Cheers, Guido

PS The whole idea behind weighting in parsimony was originally to filter data patterns incompatible with the inferred tree(s) to increase the resolution of the strict consensus trees of the MPT sample and increase branch support. But you can easily simulate data to see that this is a snake biting in its own tail. Worse case, you downweigh characters compatible with the true tree, when they are in the minority (see also the recent paper of Scotland, R.W., & Steel, M. (2015) Circumstances in which parsimony but not compatibility will be provably misleading. Systematic Biology, 64, 492–504). That is probably the reason why parsimony weighting is today only found in studies dealing with entirely extinct groups where the lack of molecular data makes it impossible to do total evidence, DNA-scaffolding, etc. (i.e. trimming the morphological data to the molecular tree); and also to cross-check the inferred phylogeny by other data.
Another was to compensate for the inferiority of parsimony-inference in respect to probabilistic approaches when it came to molecular data (see e.g. according chapters in Felsenstein, Inferring Phylogenies, 2004). Essentially, one tried to find character weighting schemes that could somehow emulate the substitution models to infer trees not so different from the ML or BI results. Overall, personal opinion, the entire exercise was utterly pointless. Its interesting from a theoretical point of view, but has little applicability these days. Parsimony works when the rate of change is low and you can do median networks or related things. When this is not the case (and noisyness is an indication that your morphological traits can quite easily change), there's no point to apply it.






 

ear...@gmail.com

unread,
Aug 25, 2017, 8:54:12 AM8/25/17
to raxml
Hi, Lucas

Okay, that sounds reasonable. So this would give you two trees to compare, the original one (always the same) and one after re-inserting each taxon, so that, for each taxon, you can calculate the distance from where it ended up in the re-insertion-tree to where it originally was. Correct?
Yes, exactly.

So you want to use a similar alignment as shown on the first page of Berger & Stamatakis "Accuracy of Morphology-based Phylogenetic Fossil Placement under Maximum Likelihood", where the first columns correspond to morphological characters, and the rest columns correspond to their molecular data? Thus, the molecular columns would be empty for fossils?

Yes.

If I understand you correctly, the node and edge distance that you want to calculate here are calculated between the placement positions of your fossil when placing it with ML and with parsimony, respectively. Right?

No, I'm only really interested in using the placement distances in the leave-one-out test, to quantify the placement error (e.g. average distance between position in reference tree and re-insertion point) of each weighting method. Since I am comparing ML and parsimony methods, the node placement distance is more relevant. That distance alone should suffice for my purposes, I think.

So, if my previous assumption about the shape of the alignment is right, you are only using the morphological characters for this, right?

Yes.

This might really mess up the branch lengths... Maybe Alexis can tell you a bit more about this.

I hadn't considered this. I will re-read Berger & Stamatakis (2010) and Beger et al. (2011) to see how they dealt with this. I would guess that they only keep the topology of the molecular tree, and re-optimise the branch lengths with the morphological partition only. Would that make sense?

Cheers,

Eduardo

ear...@gmail.com

unread,
Aug 25, 2017, 9:32:48 AM8/25/17
to raxml
Hi Guido,

Thanks for your thoughtful comments!

So, in your case, the best you can get with a parsimony weighting (from a theoretical point of view) is something similar to the EPA result. So what do you hope to learn from the placement distances?
Finding the best parsimony weighting scheme, the one with the lowest distance to the EPA result? But if this is the logic, why using parsimony weighting at all?!

No, I don't want to measure the distance of the parsimony results to the EPA resutls. I want to use the distances to perform a leave-one-out test described in Berger & Stamatakis (2010). The test takes my "known" reference tree and prunes out and then re-inserts each species one by one. The idea is that the distance between the original position and the re-insertion position of the species in the reference tree will give an estimation of the potential accuracy of the method when trying to place fossils. I would perform the test using each of the weighting methods.

Are you planning to make simulations with morphological partitions increasingly incompatible with the true or molecular tree to test which method performs best?

No, I will assess the accuracy of the methods just for my data set, under the assumption that the molecular tree is correct. It would be nice to assess overall performance of these methods with simulations, but I don't feel ready to tackle the problem of simulating realistic morphological data yet.


Frankly, I would not worry about applying any parsimony weighting scheme. If you have a good molecular guide tree and want to place your fossils individually in that guide tree, you just use the EPA algorithm. If the reviewers mourn, you just do a quick parsimony-optimisation to please them (can also be done with RAxML) and then discuss, why in some cases the most-parsimonious placement is not the most likely one (because it didn't cope with data patterns incompatible with the molecular tree).
And if you want to study the signal in your morphological partition with respect to the fossils. Well, since the data are non-treelike, you anyway should not rely on tree inference but do network-based exploratory data analysis, so also no point in parsimony-weighting here.
For total evidence, you just used the EPA-generated weights for weighting the morphological partition.

The main reason for trying parsimony as well is that RAxML has some limitations for the analysis of morphological data. It does not accept cells with polymorphisms (of which I have plenty in multistate charcters), and it cannot work with combinations of ordered and unordered multistate characters. How much useful signal I would be really losing with those restrictions, I don't know, which is why I want to try the different methods.

I am not used to working with phylogenetic networks, but I am aware of your papers on the topic, and I will try to learn more about it.

Cheers,

Eduardo

Lucas Czech

unread,
Aug 25, 2017, 9:38:35 AM8/25/17
to ra...@googlegroups.com

Hi Eduardo,

all right, I'm starting to understand ;-) Also, Guido's answer shed some light on this issue, thanks for that!

Okay, so you want to measure the distance of each re-inserted taxon to its original position, and do that once with ML and once with parsimony, and then compare. Did I get it right this time?

As for the branch lengths: We briefly discussed this in our group. From an algorithmic point of view, it makes sense to re-optimize the branch length with morphological data, while keeping the molecular topology. However, depending on the number of columns/traits that you have, this might give unstable/unreliable numbers. How long is your morphological alignment?

Lastly, using this tree (molecular topology, morphological branch length) with EPA would be easiest, I think (as compared to using RAxML with the tree as constraint). Also, calculating the needed distances from the resulting placement files is way easier for me to implement than the two-trees version.

So long
Lucas

ear...@gmail.com

unread,
Aug 25, 2017, 10:27:37 AM8/25/17
to raxml
Hi Lucas,
 
Okay, so you want to measure the distance of each re-inserted taxon to its original position, and do that once with ML and once with parsimony, and then compare. Did I get it right this time?

Yes, that's it.

As for the branch lengths: We briefly discussed this in our group. From an algorithmic point of view, it makes sense to re-optimize the branch length with morphological data, while keeping the molecular topology. However, depending on the number of columns/traits that you have, this might give unstable/unreliable numbers. How long is your morphological alignment?

Fairly short, I'm afraid. I estimate that the final matrix will have between 72 to 85 characters, and some of those may end up with weights of 0.

Also, calculating the needed distances from the resulting placement files is way easier for me to implement than the two-trees version. 

That would only work for EPA with RAxML, not with parsimony, right?

Cheers,

Eduardo

Lucas Czech

unread,
Aug 25, 2017, 10:39:17 AM8/25/17
to ra...@googlegroups.com

Hi,

Fairly short, I'm afraid. I estimate that the final matrix will have between 72 to 85 characters, and some of those may end up with weights of 0.
Hm okay. That might still work reasonably well if your tree is not too big (more than a few hundred taxa maybe).


That would only work for EPA with RAxML, not with parsimony, right?
RAxML also has an option to run EPA with parsimony (-f y). And for the normal ML EPA (-f v), it should be possible to specify a binary model for the morphological characters (-m). I guess that EPA is able to use this model. Maybe Alexis can answer this. Also, see the RAxML Manual for details.

Best
Lucas

Grimm

unread,
Aug 25, 2017, 11:00:57 AM8/25/17
to ra...@googlegroups.com

Hi

That would only work for EPA with RAxML, not with parsimony, right?
RAxML also has an option to run EPA with parsimony (-f y). And for the normal ML EPA (-f v), it should be possible to specify a binary model for the morphological characters (-m). I guess that EPA is able to use this model. Maybe Alexis can answer this. Also, see the RAxML Manual for details.
If you can read out the parsimony weighting schemes, you can implement them for the EPA framework in RAxML, too.
You just read-in the weighting table with -a add this option to the -f v call-line for invoking the EPA; maybe also works with -f y. If not, ML optimisation using the MK model is not so different from MP, the main differences will be due to the weighting scheme used.

Here's the codelines from the batch, I used for our Osmundaceae paper (for RAxML 7). Back then, one had to first run -f u to do the weighting, and then -f v using the u-output for the final EPA.

raxmlHPC -s Osmceae.Morph.epf -m MULTIGAMMA -n Osmceae.WC.GTR -K GTR -f u -# 1000 -t RAxML_besttree.Osmceae.MolesRed
raxmlHPC -s Osmceae.Morph.epf -m MULTIGAMMA -n Osmceae.WC.MK -K MK -f u -# 1000 -t RAxML_besttree.Osmceae.MolesRed
raxmlHPC -s Osmceae.Morph.epf -m MULTIGAMMA -n Osmceae.WC.MP -f U -# 1000 -t RAxML_besttree.Osmceae.MolesRed

raxmlHPC -s Osmceae.Morph.epf -m MULTIGAMMA -n Osmceae.Plot.GTR -f v -# 1000 -t RAxML_besttree.Osmceae.MolesRed -a RAxML_weights.Osmceae.Morphplot.GTR

For standard EPA that changed in RAxML8, as the weight calibration is also part of -f v (I think). But the -u function is still there in RAxML 8, so I guess -f v -a will be still running, too.

Cheers, G.

Grimm

unread,
Aug 25, 2017, 11:40:08 AM8/25/17
to raxml
Hi Eduardo,


The main reason for trying parsimony as well is that RAxML has some limitations for the analysis of morphological data. It does not accept cells with polymorphisms (of which I have plenty in multistate charcters), and it cannot work with combinations of ordered and unordered multistate characters. How much useful signal I would be really losing with those restrictions, I don't know, which is why I want to try the different methods.

I see. But this is no problem. You just recode the matrix to trick RAxML and emulate ordering and polymorphism using binarisation.
I did this some time ago, also to test if the replacing of the ambiguous characters by "?" needed for RAxML (I remember darkly that Alexi promised me long ago an implementation for MULTIGAMMA that can handle polymorphisms...).
If you have
TaxonA 0
TaxonB 1
TaxonC 2
TaxonD 1/2
you just binarise the character
TaxonA 100
TaxonB 010
TaxonC 001
TaxonD 011
And then weight the three characters equal to a normal single without ambiguity/polymorphism, using the -a option (or just run the analysis, the result will show if that would be overweighting them). You can also directly code all your traits in this manner using a number of characters, which you tick for "is 0", "is 1", "is 2" etc
Effectively, you can also just treat the polymorphism as a new state, but if you have many (and I suppose not only in the fossils but even more in the extant taxa you use for the molecular tree backbone), I would put the effort in binarising.

Second, regarding the combinations or unordered and ordered. The latter can be coded in a similar way, let's say we have an ordered character with three states
TaxonA 0
TaxonB 1
TaxonC 2
it becomes
TaxonA 00
TaxonB 01
TaxonC 11
To assure same treatment than with parsimony, no character weigthing here, as it should be 1 step from 0->1->2

Because the parsimony weightings are posterior processes, I would rather use customised matrices as above and compare the placements between the matrices (standard RAxML input matrix, i.e. no weighting, all unordered, polymorphism treated as missing vs. unweighted and polymorphism binarised vs. weighted, polymorphism binarised vs. ordered and polymorphism binarised.

Another thing is also to check the placement based on character subsets. You probably have different organs etc. in you morphological partition, which can be a source of internal conflict. Placing fossils is not tricky because of the conflicting data, but because each character has essentially a not-known evolutionary property (that is why no-one ever simulated a binary or other matrix matching real-world morpholgocial matrices). The ideal weighting would be a prior weighting (but we usually have no basis to come up with something like this)

You are free to go down the line of parsimony weighting, but I would stick with one method and address the issue of ordered, unordered, and polymorphic characters via the primary matrix coding. I stopped recoding because for the data I work at, it made largely no difference whether you treat a character as ordered or not replace polymorphism by "?", but if you have many polymorphisms, coding them as such in the binary fashion may of course increase the EPA identification potential.

And maybe RAxML 9 will have the polymorphism-treatment for Multistate data implemented and allow for mixed ordered-multistate models.

 

I am not used to working with phylogenetic networks, but I am aware of your papers on the topic, and I will try to learn more about it.


Their major strength are that they are not restricted to a tree. Another is, you can easily combine ordered and unordered characters, and continuous and categorical using according distance calculations.
They are also very good for just checking-out the data and very easy and quick to generate: you take your matrix, you calculate simple or sophisticated distances that take into account ordered, unordered, and you run the neighbour-net. If it looks like this


stay with the EPA to place your fossils.
When it looks like this

there may be some hope for morph-based phylogenetics :)

Cheers, Guido

PS Same data set, but in the second all taxa with less than 50% missing data were eliminated, they will be featured in my next blogpost on trees for non-treelike data.

ear...@gmail.com

unread,
Aug 26, 2017, 5:23:54 PM8/26/17
to raxml
Hi Guido,

It didn't occur to me to recode the matrix into binary characters! That would indeed be the solution to the missing and mixed character types problems, and would allow me to do everything with RAxML. However, wouldn't the 'subcharacters' created with the recoding violate the assumption of site independence? Do you know of papers discussing this in the context of statistical phylogenetics?

Thanks for the tips on phylogenetic networks. Where will you be blogging?

Cheers,

Eduardo

Grimm

unread,
Aug 27, 2017, 9:50:08 AM8/27/17
to raxml


Am Samstag, 26. August 2017 23:23:54 UTC+2 schrieb ear...@gmail.com:
Hi Guido,

It didn't occur to me to recode the matrix into binary characters! That would indeed be the solution to the missing and mixed character types problems, and would allow me to do everything with RAxML. However, wouldn't the 'subcharacters' created with the recoding violate the assumption of site independence?

It would. But there are only three circumstances, in which we can explicitly deal with site dependence. rRNA and well-known proteines where we have the site-depency information because of the secondary and tertiary structure of the final molecules, and traits where we know the genetics (genes) that trigger them.

But for the morphological matrices we usually work with (particularly when including fossils) we have no idea about the trait genetics, and worse, epigenetic effects on them (not to mention that we may misinterpret fossil traits; see this post for the sources of tree-incompatible signal). Let me give you a theoretical example, why we should not be bothered about dependence as we cannot test for it.

Knowing the genetics and considering the original constraint on phylogenetic analysis to only use neutral genes and being aware that "no colour" is the ancestral state and modification of the original gene B changes colour either to blue or red, our matrix would have one binary character and one step-matrix character (in parsimony), which equals two (dependent) binary characters.
Form: 0 – circle, 1 – square/star (star being an adaption of squares, so not neutral)
Colour: 00 – no colour, 10 – blue/green (green being an adaptation of blues), 01 – red.
If we include the third gene, which has two functions (e.g. triggered by slightly different environmental conditions or could be epigenetics), we end up with
Form: 00 – circle, 10 – square, 11 – star
Colour: 000 – no colour, 100 – blue, 110 – green, 001 – red.
The new second and third character are dependent on the first for form and colour – this is the situation depency discussion is about. But, overlooking the fact that of course we would like to differentiate between stars and squares as they are not identical, but green or blue stars represent a more derived form as part of an evolutionary sequence.

But just from the discernible characters we would intuitively go for 2 ternary characters or one ternary and one 4-state character when finding colourless fossils associated to the lineage as in the evolutionary scenario.
Form: 0 – circle, 1 – square, 2 – star
Colour: 0 – no colour, 1 – red, 2 – blue, 3 – green.
If we just have the four modern taxa, this coding would give us the true tree. Adding the fossils would force parsimony into a collapsed consensus tree (because trees cannot handle ancestors and descendants) and tree-decisive methods into a tree that may be quite arbitrary.
If we binarise this we end up with seven characters. This may over-prefer a wrong tree topology, but gives ML the chance to play out the stochasticity card via the Gamma distribution modelling unequal substitution rates for the characters. For EPA it may even be beneficial, as we have the molecular tree to tell the algorithm how to deal with (weigh) each of the binary characters and to avoid model-assumption-effects (in case there are any)

The scoring problem (a problem much more imminent than character depency) starts with the following. Some would argue but green is intermediate between blue and red and should be ordered. They also would point out that no colour is absence of a discernible trait, so should be coded as a gap (which is treated as missing data in parsimony, and all-states ("N") in ML and would have one unordered ternary and one ordered ternary characters (which in standard-parsimony again equals to binarised characters during inference; 0 -> 2 is two steps, so it doesn't matter during inference if you code it as 0 and 2 and ordered, or 00 and 11 and binary: both times 2-steps)
Colour: ?? – no colour, 00 – red, 01 – green, 11 – blue
So just by ordering based on our perception we deviate from the actual situation that green can only be evolved in blue lineages.
Others could argue that the colours need to be more than one character, because blue, green, and red colours need different components to be active (something is blue because the yellow and red light wavelengths are filtered: 100, green is a mix of reflected blue and yellow: 110, and red means all yellow and blue wavelengths are filtered: 001)

With real world data the only thing we usually have is the combinations today or in the past, but no idea what triggers them. So since I cannot know whether squares and stars are related and how green fits between red and blue, I (personally) would just code the most I can get out of the situation
Form1: 0 - roundish, 1 - pointy
Form2: 0 - few pointy edges, 1 - many pointy edges (this would be "?" for all circles as we can't tell if the round is more similar to few pointy edges: no pointy edge or many pointy edges: a multi-star with an extreme number tiny spikes finally approaching a circular form, i.e. infinitive number of spikes)
Colour1: 0 - no colour, 1 - coloured
Colour2: 0 - not blue, 1 - blue
Colour3: 0 - not green, 1 - green
Colour4: 0 - not red, 1 - red
And then see how this matches up with other characters, or – in case it's a group with modern representatives showing different combinations of colour and form, map each character on the molecular tree to see which one is in good agreement with the tree and estimate the probability. (also what EPA does for optimisation)




Do you know of papers discussing this in the context of statistical phylogenetics?

Farris wrote a lot of papers in the 80s and 90s also discussing site indepency (I think), but all probably in the context of parsimony. Then there are of course the many papers in Cladistics to fight ML and distance-methods and to argue for superiority of MP, but I never bothered reading or collecting them. Because of my genetics background (I never studied "phylogenetics") – too many cases in which expression levels define form, and not the absence or presence of a gene – I never was interested in the question. For me, it always was a useless theoretical discussion ignoring the fact that we have to work with characters coding traits where we have no idea about how they are controlled (as in the quite simple example above ending with hell-complex matrix)
So after my master thesis, I just used trial-and-error following the logic: the character coding will be inevitably biased (see example above) and fail the assumptions of the inference, so just focus on signal stability and the common-sense reasonability of the result. And regarding signal, its always better to have characters with few states (ideally binary), no matter if potentially dependent or not. Because it facilitates the interpretation of bootstrap support values, which ideally represent the fraction of characters supporting a split. And then just check in the matrix if those characters code for the same organ/trait complex or not. For the above example, I can easily test if the overcoding of colour as four states (with regard to the genetic situation, colour weights double as strong than form but should be equal) affects my analysis by just eliminating them prior to analysis. That the beauty of "over-scored" matrices: it's easy to eliminate characters, but it takes much more time to re-code them; and although I loved step-matrices and ordered characters the problem with them is that you put assumptions in the analysis that you then may have to down-scale again.
And if you have good ontogenetic data supporting a step-matrix or ordered character, it is logical to code this in a binary form. It's just a different code for the exactly the same thing during analysis.

In case a reviewer comes up with the depency argument, you can simply point to the differences (or lack of them) when using different scoring schemes. For instance, when your original matrix with multi-state characters all treated as unordered places fossil A at the same (or a close) place than using the fully binarised matrix without any priori weighting, the dependency question is obviously no issue within the probabilistic framework you use.
And if you find strong differences, you anyway have to go back to your data what's the issue or discuss it based on the placement optimisation results such as: placement of xx is affected by whether characters are treated as unordered or ordered; ... is affected by the binarisation of multi-state characters not compensated by downweighting the set of binaries; etc.


 
Thanks for the tips on phylogenetic networks. Where will you be blogging?

The best network blog site there is ;) http://phylonetworks.blogspot.com/

Cheers, Guido

Alexandros Stamatakis

unread,
Aug 27, 2017, 11:27:14 PM8/27/17
to ra...@googlegroups.com
>> That would only work for EPA with RAxML, not with parsimony, right?
> RAxML also has an option to run EPA with parsimony (-f y). And for the
> normal ML EPA (-f v), it should be possible to specify a binary model
> for the morphological characters (-m). I guess that EPA is able to use
> this model. Maybe Alexis can answer this. Also, see the RAxML Manual for
> details.

Yes, EPA can do this.

Alexis

ear...@gmail.com

unread,
Aug 28, 2017, 7:07:20 AM8/28/17
to raxml
Thanks, Guido. You give me plenty to ponder about. The ordering of my characters would be generally less controversial, I think, because they relate to meristic series and relative displacements of structures, although that may be more reflective of the common dispositions of palaeontologists than the state of the art on the genetic and developmental factors behind those traits. Hopefully the topic will become more discussed in the literature as the use of statistical methods on morphological data grows.

Anyway, in my case I have advantage of having a reference molecular tree to check the results against with the leave-one-out test, so I can just take the more pragmatic approach of comparing the performance of the different codings, as you suggest (albeit I can't compare all the possible schemes, because of the polymorphism and mixed ordering issues).

Cheers,

Eduardo

ear...@gmail.com

unread,
Aug 28, 2017, 7:22:41 AM8/28/17
to raxml
Thanks, Alexis. The thread is becoming rather long now, but at some point I mentioned that my interest in using parsimony was to be able to use other parsimony programs in order circumvent the limitations in RAxML with the handling of multistate polymorphic characters and mixed ordering. Guido proposed recoding the matrix into binary characters. What would be your approach?

Cheers,

Eduardo

Alexandros Stamatakis

unread,
Aug 29, 2017, 1:38:06 AM8/29/17
to ra...@googlegroups.com
Dear Eduardo,

I think Guido's suggestion is a really good one, I can't propose any
good alternative.

Alexis
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Grimm

unread,
Aug 29, 2017, 3:22:49 AM8/29/17
to raxml
 

I think Guido's suggestion is a really good one, I can't propose any
good alternative.
 

So, it's decided. Eduardo, you know what to do ;) It will be new land, so you probably will get some problems with the reviewers (the anoynmous ones), but once passed, you may break a hole in the wall that they cannot close again :D

PS "... (Alexandros Stamatakis, pers. comm., 2017) ..." and pointing to the relevant RAxML google group thread can also help to convince reviewers (and more importantly editors). I myself used this option a few times.

PPS Another get-most-out-of-EPA tip: dating.
The EPA results (whether unambiguous or not) are a good basis to argue how fossils are treated when using the fossilized-birth-death approach (FBD) for dating, in case you have a nice set of fossils (FBD prefers few branches with many fossils). Since FBD treats all fossils simply as members of a lineage (a subtree), not specifying where they should be within that lineage and also not defining a minimum age like in traditional node dating, even ambiguous EPA results can be useful. You just assign them to the lineage including the two (or more) alternative placements.
You also may try Bayesian Total Evidence (BTE) dating if you have a taxon-rich data set with relatively few fossils: When the EPA is decisive (unambiguous, full support for a single position), the Bayesian inference has a higher chance not to get lost in uncertainty. Although there is no guarantee. Since EPA places one fossil at a time, it is not prone to between-fossils signal friction, but BTE may. For our Osmundaceae data (EPA placements are in this paper, which was accepted and published after the dating paper that was based on it...), BTE got caught by the complex signals of the fossil assemblage.

ear...@gmail.com

unread,
Aug 29, 2017, 8:45:40 AM8/29/17
to raxml
Great! Thanks, Alexis.

ear...@gmail.com

unread,
Aug 29, 2017, 8:47:30 AM8/29/17
to raxml
"... (Alexandros Stamatakis, pers. comm., 2017) ..."

No, that should be "(Guido Grimm and Alexandros Stamatakis, pers. com., 2017)" :-)

Thanks for all your help!
Reply all
Reply to author
Forward
0 new messages