Bug? in symmetric_difference or newick cross conversion

21 views
Skip to first unread message

Rich Drewes

unread,
Sep 5, 2014, 3:07:13 PM9/5/14
to dendrop...@googlegroups.com
Hello,

I was exploring the symmetric_difference metric provided by Dendropy and not getting the results I expected on trees that I was creating from newick tree strings.  The following example program illustrates what I am seeing.  Two trees are created randomly, and the symmetric difference is computed.  Then the trees are converted to newick strings, and then converted back from those strings to Trees, and compared again with symmetric_difference.  The result sometimes differs from the symmetric_difference computed on the original trees.

Possibly there is is some information lost in the conversion to and from newick trees but I can't see why that would be the case.  Can anyone clarify this?

Also possibly I'm doing something very wrong as I'm quite new to Dendropy.

Dendropy version 3.12.0 .

Thanks,
Rich

----
#! /usr/bin/env python

import dendropy

trees = []
for i in range(100):
    d = []
    for j in range(2):
        t=dendropy.treesim.birth_death(birth_rate=1.0, death_rate=0.5, ntax=10)
        d.append(t)
    trees.append(d)

for t1, t2 in trees:
    d1 = t1.symmetric_difference(t2)
    nt1=dendropy.Tree.get_from_string(t1.as_newick_string(), schema='newick')
    nt2=dendropy.Tree.get_from_string(t2.as_newick_string(), schema='newick')
    nd1 = nt1.symmetric_difference(nt2)
    if nd1!=d1:
        print "WEIRD:  the symmetric difference of the original trees is", d1, "but after converting them to newick representation then back, it is", nd1
    else:
        print "OK"

Jeet Sukumaran

unread,
Sep 5, 2014, 11:24:19 PM9/5/14
to dendrop...@googlegroups.com
Hi Rich,

This does not address the immediate problem, but let us first get this
out of the way: you have to ensure that all phylogenetic data objects in
the same "universe" share the same operation taxonomic unit references:

https://pythonhosted.org/DendroPy/tutorial/taxa.html

OTU's in DendroPy are distinct from their string labels, and just
because two taxonomic entities have the same label, it does not mean
that they are the same operational taxonomic concept. In particular,
when creating or reading trees (or any other phylogenetic data object),
if you want to do any operations with them, you have to make sure that
they reference the same set of OTU's as represented by TaxonSet objects
[NOTE: in DendroPy 4, this will be called "TaxonNamespace" to emphasize
this concept]. By default, unless reading via a managed collection
(e.g., a TreeList or a DataSet), every tree read in will get its own
taxon namespace reference. You have to explicitly specify the TaxonSet
to use to ensure all trees that are compared or operated on share the
same TaxonSet.

In Dendropy 3, the `symmetric_difference()` operation as well as many
other binary tree operations generally coercs both trees to have the
same TaxonSet reference by "remapping" the OTU's from one tree to
another. Hence, that is why you still got meaningful results instead of
an error. In DendroPy 4, things will be a lot more strict: if you try to
compare two trees with different TaxonNamespace references, you will get
an error.

Ok, with that out of the way, let us address your problem. It comes from
the fact that the initial trees are compared as rooted. But when writing
out using the "as_newick_string()" function, the rooting statement is
not written, and thus when reading it back in again the tree is read as
unrooted. So, if you go this route, you will have to explicitly specify
that the tree is treated as rooted by:

nt2=dendropy.Tree.get_from_string(
t2.as_newick_string(),
as_rooted=True,
schema='newick')

But you should not this route. There is a good reason
"as_newick_string()" writes out an incomplete string: it is not a public
function for public usage, but an internal one used for debugging. In
fact, in DendroPy4, this method longer exists, but is replaced by
"tree._as_newick_string()" to emphasize that this is not a public method.

You should be using the "as_string()" function, which *is* a public
function, and the recommended way to get a string represented of a
phylogenetic data object. You can see the difference in the two
representations here:

t2 = dendropy.Tree.get_from_string("[&R](A,(B,C));", "newick")
print(t2.as_newick_string())
print(t2.as_string("newick"))

The "as_string()" function is *very* rich, and supports lots of features
that control how the tree is rendered. The "as_newick_string()" does not:

t2 = dendropy.Tree.get_from_string("[&R] (A,(B,C));", "newick")
for nd in t2:
nd.annotations.add_new("!color", "#ff6600")
t2.annotations.add_new("foo", "bar")
print(t2.as_newick_string())
print(t2.as_string("newick", suppress_annotations=False))

So, putting it altogether, we get:

(1) Trees have to reference the same taxonomic namespace, as given by
the TaxonSet object reference
(2) Use the proper "as_string('newick') function to serialize the trees.

##########################################################

#! /usr/bin/env python

import dendropy

taxon_namespace = dendropy.TaxonSet()
trees = []
for i in range(100):
d = []
for j in range(2):
t=dendropy.treesim.birth_death(
birth_rate=1.0,
death_rate=0.5,
ntax=10,
taxon_set=taxon_namespace)
d.append(t)
trees.append(d)

for t1, t2 in trees:
d1 = t1.symmetric_difference(t2)
nt1=dendropy.Tree.get_from_string(
t1.as_string("newick"),
schema='newick',
taxon_set=taxon_namespace)
nt2=dendropy.Tree.get_from_string(
t2.as_string("newick"),
schema='newick',
taxon_set=taxon_namespace)
nd1 = nt1.symmetric_difference(nt2)

if nd1 != d1:
print "WEIRD: the symmetric difference of the original
trees is", d1, "but after converting them to newick representation then
back, it is", nd1
else:
print "OK"

#############################
> --
> You received this message because you are subscribed to the Google
> Groups "DendroPy Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-user...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--



--------------------------------------
Jeet Sukumaran
--------------------------------------
jeetsu...@gmail.com
--------------------------------------
Blog/Personal Pages:
http://jeetworks.org/
GitHub Repositories:
http://github.com/jeetsukumaran
Photographs (as stream):
http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):
http://www.flickr.com/photos/jeetsukumaran/sets/
--------------------------------------

Rich Drewes

unread,
Sep 7, 2014, 3:07:26 PM9/7/14
to dendrop...@googlegroups.com
Hello,

Thanks for your reply.

I was aware of the "taxonomic universe" issue (which turns out not to be relevant to this problem, as you explain).  My example program initially used a shared taxon set, but as a test I removed it and was surprised to see that doing so did not affect the results.  I figured some magic was happening behind the scenes (the mapping you describe, which will apparently not take place in the next version of Dendropy) and I decided to leave the shared taxon set out of my example program to make it simpler.

The explanation of the tree root issue makes sense.

Thanks to you and other Dendropy developers!

Rich
Reply all
Reply to author
Forward
0 new messages