help me explain this symmetric_difference behavior

30 views
Skip to first unread message

Rich Drewes

unread,
Jan 13, 2016, 8:38:11 PM1/13/16
to DendroPy Users
This simple test program:

----
import dendropy
from dendropy.calculate import treecompare

t=dendropy.Tree.get_from_path('testtree.tre', 'newick')
print "here is the tree we loaded:"
print t

# do a simple comparison of tree to itself
sd=treecompare.symmetric_difference(t, t)

print "why is the tree different now?"
print t
----

Produces the following output:

----
here is the tree we loaded:
((T1:0.436709028633,T2:0.436709028633):3.24085844138,(((T3:0.317726916724,T4:0.317726916724):0.715966590258,(T5:0.028220518964,T6:0.028220518964):1.00547298802):2.40608658995,((T7:1.80803701947,(((T8:0.107837785359,T9:0.107837785359):0.522902197758,(T10:0.345780037892,T11:0.345780037892):0.284959945226):1.08433960007,(T12:0.872051946511,(T13:0.00781083876464,T14:0.00781083876464):0.864241107747):0.843027636673):0.0929574362867):1.2816938212,((T15:0.282799679467,T16:0.282799679467):1.12870039375,(T17:0.292885603324,(T18:0.256266318982,(T19:0.228237897404,T20:0.228237897404):0.0280284215778):0.0366192843416):1.11861446989):1.67823076746):0.350049256255):0.237787373088):3.57143334276
why is the tree different now?
((T1:0.436709028633,T2:0.436709028633):3.47864581447,((T3:0.317726916724,T4:0.317726916724):0.715966590258,(T5:0.028220518964,T6:0.028220518964):1.00547298802):2.40608658995,((T7:1.80803701947,(((T8:0.107837785359,T9:0.107837785359):0.522902197758,(T10:0.345780037892,T11:0.345780037892):0.284959945226):1.08433960007,(T12:0.872051946511,(T13:0.00781083876464,T14:0.00781083876464):0.864241107747):0.843027636673):0.0929574362867):1.2816938212,((T15:0.282799679467,T16:0.282799679467):1.12870039375,(T17:0.292885603324,(T18:0.256266318982,(T19:0.228237897404,T20:0.228237897404):0.0280284215778):0.0366192843416):1.11861446989):1.67823076746):0.350049256255):3.57143334276
----

Can anyone explain why the tree changes? (Look at the number of parens before taxon T3, and the extra branchlen 0.237787373088 in the first printing of the tree). At first I thought the change was inconsequential, just a rerooting or something that didn't change the topology, but after some consideration I don't think that's the case. In any event, why is symmetric_difference() changing anything at all?

Thanks,
Rich

Jeet Sukumaran

unread,
Jan 13, 2016, 9:30:41 PM1/13/16
to dendrop...@googlegroups.com
1. When calculating the symmetric difference, DendroPy has to calculate
the splits/bipartition hashes.

2. When doing this on an unrooted tree, the basal (root) split has to
be collapsed for correct behavior.

3. This is counter to most folks' (initial) intuition, and many a
investigator has been thrown off by this, to the point of scornful
skepticism, bitter frustration, adamant denial, condescending criticism,
or righteous fury.

4. Emotional responses nonwithstanding, these folks are wrong. Nothing
changes the fact that the collapse of the basal bifurcation on unrooted
trees is absolutely necessary to produce correct (and, more importantly,
arguably, predictable) results when comparing splits across trees. And
produce correct results it does.

5. The reason for this is briefly mentioned here (extracted from a series
of email responses):


***

Note that with unrooted trees, the basal bifurcation will be
collapsed. This is (briefly) mentioned here:
http://dendropy.org/primer/bipartitions.html . No real explanation is
given in the documentation, but the reason is that, on unrooted trees,
retaining the basal bifurcation results in an artifactually redundant
redundant splits hash on the tree. The practical effect of this is
rooted trees and unrooted trees cannot be predictably and meaningfully
compared.

Incidentally, the intuition behind this working is that, no matter where
the root is placed on a tree, the split immediately below that is
excluded from analysis. But instead of seeing this as different splits
being excluded from different trees depending on the rooting, we should
see it as an entire set of splits being excluded across all analysis,
and each unrooted tree has one and exactly one split belonging to that
set.

In other words, it is impossible for a split that has been excluded from
analysis due to the basal bifurcation being collapsed on any given
unrooted tree to be found on any other unrooted tree that has a
different pseudo-root placement, so in terms of support for or against
any particular tree, this excluded split cannot count in any way. And,
of course, if another tree has the same root placement, then it too will
have the same split excluded.

***

6. And here is a script that explores this:

~~~
#! /usr/bin/env python

import dendropy
import math

trees_str = """\
[&U] ((A,B),(C,(D,E)));
[&U] ((A,B),(C,(D,E)));
[&U] ((A,B),(C,(D,E)));
[&U] ((A,B),(C,(D,E)));
[&U] ((A,B),(C,(D,E)));
[&U] ((A,B),(C,(D,E)));
[&U] ((A,B),(C,(D,E)));
[&U] ((A,B),(C,(D,E)));
[&U] ((A,B),(C,(D,E)));
[&U] ((A,B),(C,(D,E)));

[&U] (((A,B),C),(D,E));
[&U] (((A,B),C),(D,E));
[&U] (((A,B),C),(D,E));
[&U] (((A,B),C),(D,E));
[&U] (((A,B),C),(D,E));
[&U] (((A,B),C),(D,E));
[&U] (((A,B),C),(D,E));

[&U] ((((A,B),C),D),E);
[&U] ((((A,B),C),D),E);
[&U] ((((A,B),C),D),E);
[&U] ((((A,B),C),D),E);
[&U] ((((A,B),C),D),E);

[&U] ((((A,B),C),E),D);
[&U] ((((A,B),C),E),D);
[&U] ((((A,B),C),E),D);
[&U] ((((A,B),C),E),D);

[&U] (A,(B,(C,(D,E))));
[&U] (A,(B,(C,(D,E))));
[&U] (A,(B,(C,(D,E))));
[&U] (A,(B,(C,(D,E))));

[&U] (B,(A,(C,(D,E))));
[&U] (B,(A,(C,(D,E))));
[&U] (B,(A,(C,(D,E))));
[&U] (B,(A,(C,(D,E))));
[&U] (B,(A,(C,(D,E))));
"""

def run(rooting):
trees = dendropy.TreeList.get(
data=trees_str,
schema="newick",
rooting=rooting)
tree_array = trees.as_tree_array()
log_split_supports, best_tree =
tree_array.calculate_log_product_of_split_supports()
for log_split_support in log_split_supports:
print("{}".format(math.exp(log_split_support)))

print("-- Unrooted -- ")
run("force-unrooted")

print("-- Rooted -- ")
run("force-rooted")
~~~

with results:

~~~
-- Unrooted --
1.0
1.0
...
...
...
-- Rooted --
0.29956851312
0.29956851312
...
0.252268221574
0.252268221574
...
0.339591836735
0.339591836735
...
0.403265306122
0.403265306122
...
~~~

7. The moral of this story is that you have to always be aware, and if
necessary, manage the rooting state of your trees. If you want rooted
trees from a Nexus/Newick source, use "rooting='force-rooted'" when
reading the trees or prefix your tree statements with "[&R]".
Or otherwise explicitly set the rooting state (``tree.is_rooted=True``
or ``tree.is_rooted=False``) BEFORE carrying out ANY calculations or
other operations that require the split hashes on the trees.

-- jeet

--------------------------------------
Jeet Sukumaran
--------------------------------------
jeetsu...@gmail.com
--------------------------------------
Blog/Personal Pages:
http://jeetworks.org/
GitHub Repositories:
http://github.com/jeetsukumaran
Photographs (as stream):
http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):
http://www.flickr.com/photos/jeetsukumaran/sets/
--------------------------------------

Jeet Sukumaran

unread,
Jan 13, 2016, 9:33:57 PM1/13/16
to dendrop...@googlegroups.com
Discussion of this can also be found here:

http://dendropy.org/primer/treemanips.html#rooting-derooting-and-rerooting
--

Rich Drewes

unread,
Jan 14, 2016, 10:58:42 AM1/14/16
to DendroPy Users
Jeet:

Thanks for your detailed response. I suspected it was something like what you described.

However, even accepting that the collapse is necessary for the comparison, I am still unclear why the treecompare function *permanently modifies* the tree passed in to it. The comparison could be done using copies of the trees, then adjusted as necessary for the comparison, without changing the tree that is passed in. Wouldn't that be better? I looked at the docs to see if this side effect was noted and I did not find such (but I might have looked in the wrong place).

If the topology of the tree were not changed by the treecompare function, I could maybe accept a rerooting. But isn't it dangerous to change the *actual topology* of the tree (the root level collapse) without letting anyone know this is happening?

If efficiency issues dictate that a copy of the tree cannot be made for the comparison, maybe this is just a documentation issue, where the docs make clear that the tree passed in to treecompare functions will be modified.

Thanks,
Rich

Jeet Sukumaran

unread,
Jan 14, 2016, 11:21:42 AM1/14/16
to dendrop...@googlegroups.com
Hi Rich,

First of all, it *is* documented that these changes happen. In multiple
places, as I have pointed out in the links. Furthermore, an argument
could be made that understanding this is part of understanding
(phylogenetic) trees rather than some technical aspect of the DendroPy
library.

Secondly, in an unrooted tree, the basal split (which corresponds to the
placement of the root) is not a real part of the tree. It is just an
artifact of using a bifurcating tree data structure to represent
something that exists in the actual model. An unrooted tree has, by
definition, no root. But that basal split is created by the placement of
a root. Remove the root and that basal split goes away. The reason for
the root is an algorithmic artifact, not part of the conceptual data
model. Hence, the basal split, as well, is an algorithmic artifact and
not part of the data model. So there is not change in topology as such
by collapsing the basal split. If the collapse of the basal bifurcation
changes anything material in what you are trying to do, then you are
really (conceptually) working with a rooted tree even if you think you
are or want to be working with with an unrooted tree.

Thirdly, no, [making a copy of tree] would not be unconditionally
better. Sure, it would be better in some contexts. But not in others
(processing 1000's or 10's of 1000's of trees). And, more to the point,
all this extra overhead to preserve an *artifactual* structure makes no
sense.

I reiterate: on an unrooted tree, that basal root is not real. It is
fantasy. Many users get very fond of it and are distressed when it
disappears. These users fall into two groups: those who really should be
working with rooted trees but do not know it, and those who do not
understand unrooted trees.
> --
> You received this message because you are subscribed to the Google
> Groups "DendroPy Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-user...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages