Tree Comparison Problem

Jeremy Brown

unread,

May 1, 2012, 11:33:57 AM5/1/12

to dendrop...@googlegroups.com

Dear Dendrophiles,

I am trying to filter trees from several posterior distributions according to whether or not they match one of a set of reference trees. The trees in any given posterior distribution may have a reduced number of taxa compared to the reference trees, so prior to comparison I prune down each reference to match the taxa in any given posterior sample. I've then been using the false_positives_and_negatives() tree member function to perform comparisons. However, I noticed that some of the values returned after this procedure seemed off. The false_positives_and_negatives() function was giving a result of (2,2) when comparing two trees that looked the same to me. Just to be complete, I printed these trees as newick strings, read them back in, and then did the comparison again. This time, I got the expected result of (0,0). I've probably made some simple error, but I can't figure out what it is.

Any help would be greatly appreciated. I'm using version 3.11.0. Below, I've pasted an example script, an example .t file, and the output that this produces.

Thanks,
Jeremy

Script:

#!/usr/bin/env python

import dendropy

# Define reference topology
refTree = dendropy.Tree.get_from_string("((galGal3,taeGut1),allMis0,(((hg19,ornAna1),chrPic0),(pytMol0,anoCar2)))",schema="newick")

# Read in posterior distribution (just one tree for this example)
postTrees = dendropy.TreeList(taxon_set=refTree.taxon_set)
postTrees.read_from_path("example.t","nexus",taxon_set=refTree.taxon_set,as_unrooted=True)

# Prune reference trees down to only taxa in posterior trees
for i in refTree.leaf_nodes():
if not (postTrees[0].find_node_with_taxon_label(i.get_node_str())):
refTree.prune_taxa_with_labels([i.get_node_str()],update_splits=True)

# Spit out newick trees for reference
print(postTrees[0].as_newick_string())
print(refTree.as_newick_string())

# Display trees to make sure reference tree has been pruned properly
postTrees[0].print_plot()
refTree.print_plot()

# Calculate false positives and negatives in posterior tree relative to reference
print(postTrees[0].false_positives_and_negatives(refTree))

# Convert both trees to strings, then read them back in
# Recalculate false positives and negatives in posterior tree relative to reference
testTree = dendropy.Tree.get_from_string(postTrees[0].as_newick_string(),schema="newick")
print(testTree.false_positives_and_negatives(dendropy.Tree.get_from_string(refTree.as_newick_string(),schema="newick",taxon_set=testTree.taxon_set)))

Tree file ("example.t"):

#NEXUS
[ID: 8249523985]
begin trees;
  translate
      1 hg19,
      2 ornAna1,
      3 galGal3,
      4 taeGut1,
      5 anoCar2,
      6 chrPic0;
  tree rep.3100 = (2:0.076762,(((3:0.058522,4:0.026272):0.095072,5:0.143657):0.008043,6:0.041739):0.155849,1:0.145005);
end;

Output:

(ornAna1:0.076762,(((galGal3:0.058522,taeGut1:0.026272):0.095072,anoCar2:0.143657):0.008043,chrPic0:0.041739):0.155849,hg19:0.145005)
((galGal3,taeGut1),((hg19,ornAna1),chrPic0),anoCar2)
/------------------------------------------------------------------------------- ornAna1
|
|                                                           /------------------- galGal3
|                                       /-------------------+
|                   /-------------------+                   \------------------- taeGut1
+                   |                   |
|-------------------+                   \--------------------------------------- anoCar2
|                   |
|                   \----------------------------------------------------------- chrPic0
|
\------------------------------------------------------------------------------- hg19

                                                    /-------------------------- galGal3
/----------------------------------------------------+
|                                                    \-------------------------- taeGut1
|
|                                                    /-------------------------- hg19
+                         /--------------------------+
|-------------------------+                          \-------------------------- ornAna1
|                         |
|                         \----------------------------------------------------- chrPic0
|
\------------------------------------------------------------------------------- anoCar2

(2, 2)
(0, 0)

Jeremy M. Brown

Assistant Professor

Louisiana State University

Dept. of Biological Sciences

202 Life Sciences Building

Baton Rouge, LA 70803

(225) 578-1745

http://www.phyleauxgenetics.org/

Jeet Sukumaran

unread,

May 2, 2012, 1:56:55 PM5/2/12

to DendroPy Users

Hi Jeremy,

Looks like you have identified a bug in DendroPy's splits hash
calculations.

It affects unrooted trees with incomplete leaf sets. We've fixed this
in the
latest commits on GitHub, and will be pushing a bug fix release soon.

In the mean time, a work-around would be to:

(1) Not calculate any splits hashes till we know the final taxon
set.
In any case, you want to do this. Setting
'update_splits=True',
recalculates the split hashes for the *entire* tree every
time.
It would be far more efficient to recalculate the split hashes
after
all the surplus tips have been pruned.

(2) Drop the unneeded taxa from the shared TaxonSet reference.

(3) Also, it probably would be slightly more efficient to search
for the
taxa using the TaxonObject directly instead of performing
multiple
string dereferencing and matchings. So, instead of:

if not
(postTrees[0].find_node_with_taxon_label(i.get_node_str())):
refTree.prune_taxa_with_labels([i.get_node_str()],
update_splits=True)

do:

if not (postTrees[0].find_node_for_taxon(i.taxon)):
refTree.prune_taxa([i.taxon], update_splits=True)

Note that (1) and (3) above are generally the more efficient way to
implement
what you want. You probably still want to do these even if you pull
the latest
(fixed) version of the code. On the other hand (2) is the work-around
for
versions 3.11.0 or less.

The code incorporating all the modifications, including the work-
around:

#!/usr/bin/env python

import dendropy

# Define reference topology
refTree =
dendropy.Tree.get_from_string("((galGal3,taeGut1),allMis0,
(((hg19,ornAna1),chrPic0),(pytMol0,anoCar2)))",schema="newick")

# Read in posterior distribution (just one tree for this example)
postTrees = dendropy.TreeList(taxon_set=refTree.taxon_set)

postTrees.read_from_path("example.t","nexus",taxon_set=refTree.taxon_set,as_unrooted=True)

# Prune reference trees down to only taxa in posterior trees

taxa_to_remove = [] # track the taxa that have been dropped
for i in refTree.leaf_nodes():
if not (postTrees[0].find_node_for_taxon(i.taxon)):
taxa_to_remove.append(i.taxon) # track the taxa that
have been dropped
refTree.prune_taxa([i.taxon],
update_splits=False) # 'False': do *not*
calculate splits hash yet

# work-around for incomplete leaf-set tree
# split hash bug: purge TaxonSet of all
# surplus taxa
for taxon in taxa_to_remove:
refTree.taxon_set.remove(taxon)

# Spit out newick trees for reference
print(postTrees[0].as_newick_string())
print(refTree.as_newick_string())

# Display trees to make sure reference tree has been pruned
properly
postTrees[0].print_plot()
refTree.print_plot()

# Calculate false positives and negatives in posterior tree
relative to reference
print(postTrees[0].false_positives_and_negatives(refTree))

# Convert both trees to strings, then read them back in
# Recalculate false positives and negatives in posterior tree
relative to reference
testTree =
dendropy.Tree.get_from_string(postTrees[0].as_newick_string(),schema="newick")

print(testTree.false_positives_and_negatives(dendropy.Tree.get_from_string(refTree.as_newick_string(),schema="newick",taxon_set=testTree.taxon_set)))

Thanks for bringing this issue to our attention!
Let us know if this works out OK, or you have any other issues.

-- jeet

Jeet Sukumaran

unread,

May 2, 2012, 2:00:46 PM5/2/12

to DendroPy Users

[Google gets lots of things right, but it definitely gets simple
posting of code-heavy text soooooooooooooooo wrong in ALL its
platforms. Booooo Google.

I'm reposting the previous without wrapping hoping that it appears a
little more coherent.]

# work-around for incomplete leaf-set tree
# split hash bug: purge TaxonSet of all
# surplus taxa
for taxon in taxa_to_remove:
refTree.taxon_set.remove(taxon)

# Spit out newick trees for reference
print(postTrees[0].as_newick_string())
print(refTree.as_newick_string())

# Display trees to make sure reference tree has been pruned
properly
postTrees[0].print_plot()
refTree.print_plot()

# Calculate false positives and negatives in posterior tree
relative to reference
print(postTrees[0].false_positives_and_negatives(refTree))

# Convert both trees to strings, then read them back in
# Recalculate false positives and negatives in posterior tree
relative to reference
testTree =
dendropy.Tree.get_from_string(postTrees[0].as_newick_string(),schema="newick")

print(testTree.false_positives_and_negatives(dendropy.Tree.get_from_string(refTree.as_newick_string(),schema="newick",taxon_set=testTree.taxon_set)))

Thanks for bringing this issue to our attention!
Let us know if this works out OK, or you have any other issues.

-- jeet

Jeremy Brown

unread,

May 2, 2012, 2:32:46 PM5/2/12

to dendrop...@googlegroups.com

Jeet,

Many thanks for the solution and the efficiency tips. Your work-around works for me on the simple example. I'll try it on the full posterior distributions now.

One last question: it seems that your revised code never recalculates the split hashes at all. Is that correct? I thought split hash recalculation was necessary for proper tree comparison after pruning taxa.

Thanks again,

Jeremy

Jeet Sukumaran

unread,

May 2, 2012, 3:22:43 PM5/2/12

to DendroPy Users

Hi Jeremy,

The truth is, provided that the splits hashes have *never* been
calculated on a tree, they will be calculated on a first-need basis.
So as long as you can ensure that this is the case, not explicitly
calculating the splits is fine. The problem comes when you calculate
the splits hashes (either explicitly or implicitly), and then change/
modify the tree structure is some way. In this case, there is no
automatic recalculation of the splits hash, and you would need to
calculate them. In tutorials I do not describe this distinction,
because it seemed to me safer for everyone to explicitly calculate the
splits hash. That is probably the approach that should be taken here
as well --- explicit is better than implicit. So, make sure you call
the 'update_splits()' method of the tree when all the pruning etc. is
done.

Updated code:

#!/usr/bin/env python

import dendropy

# Define reference topology
refTree =
dendropy.Tree.get_from_string("((galGal3,taeGut1),allMis0,
(((hg19,ornAna1),chrPic0),(pytMol0,anoCar2)))",schema="newick")

# Read in posterior distribution (just one tree for this example)
postTrees = dendropy.TreeList(taxon_set=refTree.taxon_set)

postTrees.read_from_path("example.t","nexus",taxon_set=refTree.taxon_set,as_unrooted=True)

# Prune reference trees down to only taxa in posterior trees
taxa_to_remove = [] # track the taxa that have been dropped
for i in refTree.leaf_nodes():
if not (postTrees[0].find_node_for_taxon(i.taxon)):
taxa_to_remove.append(i.taxon) # track the taxa that
have been dropped
refTree.prune_taxa([i.taxon],update_splits=False)

# work-around for incomplete leaf-set tree
# split hash bug: purge TaxonSet of all
# surplus taxa
for taxon in taxa_to_remove:
refTree.taxon_set.remove(taxon)

# calculate splits hash
refTree.update_splits()
postTrees[0].update_splits()

# Spit out newick trees for reference
print(postTrees[0].as_newick_string())
print(refTree.as_newick_string())

# Display trees to make sure reference tree has been pruned
properly
postTrees[0].print_plot()
refTree.print_plot()

# Calculate false positives and negatives in posterior tree
relative to reference
print(postTrees[0].false_positives_and_negatives(refTree))

# Convert both trees to strings, then read them back in
# Recalculate false positives and negatives in posterior tree
relative to reference
testTree =
dendropy.Tree.get_from_string(postTrees[0].as_newick_string(),schema="newick")

print(testTree.false_positives_and_negatives(dendropy.Tree.get_from_string(refTree.as_newick_string(),schema="newick",taxon_set=testTree.taxon_set)))

You might want to update the posterior splits in the loop when you
inspect their taxa, as they do not get modified structurally.

-- jeet

> ...
>
> read more »

Reply all

Reply to author

Forward