Recalculating bipartitions after pruning taxa

40 views
Skip to first unread message

Donovan Parks

unread,
Jan 31, 2017, 11:48:54 AM1/31/17
to DendroPy Users
Hello,

I am having trouble understanding how to correctly recalculate bipartition encodings after pruning taxa from a tree. In particular, I am surprised to find that a tree with N taxa and bipartition masks of length N that is pruned to M taxa does not have bipartitions masks of length M. To explore this I have been using the following example code:

tree = dendropy.Tree.get_from_path('ex1.tree', 
                                    schema='newick',
                                    rooting='force-unrooted', 
                                    preserve_underscores=True)
tree.encode_bipartitions()
 
print 'No. leaves: %d' % sum([1 for n in tree.leaf_node_iter()])
print 'Size of taxon namespace: %d' % len(tree.taxon_namespace)
print 'Taxa: %s' % tree.taxon_namespace
print 'Size leafset: %d' % len(tree.seed_node.bipartition.leafset_as_bitstring())
print 'Size of split: %d' % len(tree.seed_node.bipartition.split_as_bitstring())
print 'No. bipartitions: %d' % len(tree.bipartition_encoding)

print '\nPrune tree:'
tree.retain_taxa_with_labels(['A','C','D','E'], update_bipartitions=True)
tree.purge_taxon_namespace()
tree.encode_bipartitions()
print 'No. leaves: %d' % sum([1 for n in tree.leaf_node_iter()])
print 'Size of taxon namespace: %d' % len(tree.taxon_namespace)
print 'Taxa: %s' % tree.taxon_namespace
print 'Size leafset: %d' % len(tree.seed_node.bipartition.leafset_as_bitstring())
print 'Size of split: %d' % len(tree.seed_node.bipartition.split_as_bitstring())
print 'No. bipartitions: %d' % len(tree.bipartition_encoding)

This produces the  following output:
No. leaves: 5
Size of taxon namespace: 5
Taxa: ['A', 'B', 'C', 'D', 'E']
Size leafset: 5
Size of split: 5
No. bipartitions: 8

Prune tree:
No. leaves: 4
Size of taxon namespace: 4
Taxa: ['A', 'C', 'D', 'E']
Size leafset: 5
Size of split: 5
No. bipartitions: 6

Why is the size of the leafset and split encoding in the pruned tree still 5 although the taxon namespace is size 4? Why has the number of bipartitions decreased from 8 (which seems like it should be 7) to 6? Am I missing a step to force a full recoding of the bipartitions using the modified taxon namespace?

Any and all pointers would be appreciated.

Regards,
Donovan

Jeet Sukumaran

unread,
Jan 31, 2017, 11:54:57 AM1/31/17
to dendrop...@googlegroups.com
Hi Donovan,

Could you post the tree string, so I can look into this on my end?

Thanks.

-- jeet
--



--------------------------------------
Jeet Sukumaran
--------------------------------------
jeetsu...@gmail.com
--------------------------------------
Blog/Personal Pages:
http://jeetworks.org/
GitHub Repositories:
http://github.com/jeetsukumaran
Photographs (as stream):
http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):
http://www.flickr.com/photos/jeetsukumaran/sets/
--------------------------------------

Donovan Parks

unread,
Jan 31, 2017, 12:04:20 PM1/31/17
to DendroPy Users
Guess that would help. :)

(A:6,(((B:7,C:3):2,D:4):5,E:11):2);

Cheers,
Donovan

Jeet Sukumaran

unread,
Jan 31, 2017, 1:03:51 PM1/31/17
to dendrop...@googlegroups.com
So, I think (and I could be wrong) that the reason for the confusion is
that you are misunderstanding the return value of
``leaf_as_bitstring()'' and ``split_as_bitstring()''. If you print these
out, you will see that they return something like : 10100. Here, a "1"
indicates the taxon at that position exists, while a "0" indicates that
it does not. Deleting the taxon does not change the length of the
bitstring because the taxon indexes are fixed.

So, if what you want to know is the size of the leafset, (which is not
necessarily a stable concept with unrooted trees, as, in principle
either side of the split could be considered its leafset depending on
rotation), you need to count the number of "1's" in leafset string:

leafset_as_bitstring().count("1")

As for the numbers of splits, I am not sure what the basis of the
expectation of 7 for the pruned tree is, but the correct calculation is:
$2N-3$, where $N$ is the number of tips. This however, does not take
into account the artifactual root edge which does not exist
conceptually. But it does exist algorithmically and programmatically,
and is thus necessarily included in the set of bipartitions encoded. so
that's why we get 8 splits for the 5 taxon tree (2*5 - 3 + 1) and 6
splits for the 4 taxon tree (2*4 - 3 + 1), with the "+1's" for the
algorithmic artifact root edge.

You might find the following diagnostic function useful to get some
insight into what's going on:

~~~
def report(tree):
print("\n\n---")
print(tree.as_string("newick"))
bipartitions = tree.encode_bipartitions()
for nd in tree:
nd.label = nd.edge.bipartition.leafset_as_bitstring()
print(tree.as_ascii_plot(show_internal_node_labels=True))
ninternal_edges = 0
nleaf_edges = 0
for nd in tree:
if nd.edge.head_node.is_leaf():
nleaf_edges += 1
else:
ninternal_edges += 1
for idx, bp in enumerate(bipartitions):
print("{:2d}/{:<2d}: {} {}".format(idx+1,
len(bipartitions),
bp.leafset_as_bitstring(),
bp.split_as_bitstring(),
))
print("Number of bipartitions: {}".format(len(bipartitions)))
print("Number of leaf edges: {}".format(nleaf_edges))
print("Number of internal edges: {}".format(ninternal_edges))
~~~

If I have misinterpreted any of your concerns, or you still have
unanswered concerns, or something I said makes no sense, please let me know.


On 1/31/17 11:48 AM, Donovan Parks wrote:
> --
> You received this message because you are subscribed to the Google
> Groups "DendroPy Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-user...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Donovan Parks

unread,
Jan 31, 2017, 3:32:26 PM1/31/17
to DendroPy Users
Hello Jeet,

Perhaps a more expanded example is needed here. What I'd like to do is the following:
1) Load two trees defined over similar, but not identical taxa
2) Prune the trees to a common set of taxa
3) Calculate the bipartition encodings for these pruned trees over the now common set of taxa
4) Use the bipartition encodings for things like calculating the Robinson-Foulds metric or looking for compatible splits

I can get this to work, but my solution seems like a hack to me which is why I am trying to understand how DendroPy handles Taxon Namespaces and Bipartitions after pruning. An example:

tns = dendropy.TaxonNamespace()
tree1 = dendropy.Tree.get_from_string('(A:6,(((B:7,C:3):2,D:4):5,E:11):2,F:2);', 
                                    schema='newick',
                                    rooting='force-unrooted', 
                                    taxon_namespace=tns)
                                    
                                
tree2 = dendropy.Tree.get_from_string('((A:6,(((B:7,CX:3):2,D:4):5,E:11):2):1,FX:2);', 
                                        schema='newick',
                                        rooting='force-unrooted', 
                                        taxon_namespace=tns)
                                        
tree1.encode_bipartitions()
tree2.encode_bipartitions()
print 'Size of taxon namespace: %d' % len(tree1.taxon_namespace)
print 'Size of split for T1: %d' % len(tree1.seed_node.bipartition.split_as_bitstring())                                      
print 'Size of split for T2: %d' % len(tree2.seed_node.bipartition.split_as_bitstring())                                      
                                       
# prune trees to a common set of taxa
taxa1 = set()
for t in tree1.leaf_node_iter():
    taxa1.add(t.taxon.label)
    
taxa2 = set()
for t in tree2.leaf_node_iter():
    taxa2.add(t.taxon.label)
    
taxa_in_common = taxa1.intersection(taxa2)
print 'Taxa in common:', len(taxa_in_common)
tree1.retain_taxa_with_labels(taxa_in_common)
tree2.retain_taxa_with_labels(taxa_in_common)

# prune taxon namespace and recalculate splits
#tree1.purge_taxon_namespace()
tree1.encode_bipartitions()
tree2.encode_bipartitions()
print 'Size of split for T1 (after pruning): %d' % len(tree1.seed_node.bipartition.split_as_bitstring())
print 'Size of split for T2 (after pruning): %d' % len(tree2.seed_node.bipartition.split_as_bitstring())
  
tns2 = dendropy.TaxonNamespace()
treeX = dendropy.Tree.get_from_string(tree1.as_string(schema="newick",), 
                                        schema='newick',
                                        rooting='force-unrooted', 
                                        taxon_namespace=tns2)
treeY = dendropy.Tree.get_from_string(tree2.as_string(schema="newick",), 
                                        schema='newick',
                                        rooting='force-unrooted', 
                                        taxon_namespace=tns2)
                                        
treeX.encode_bipartitions()
treeY.encode_bipartitions()
print 'Size of split for TX: %d' % len(treeX.seed_node.bipartition.split_as_bitstring())
print 'Size of split for TY: %d' % len(treeY.seed_node.bipartition.split_as_bitstring())                                        

This produces:
Size of taxon namespace: 8
Size of split for T1: 6
Size of split for T2: 8
Taxa in common: 4
Size of split for T1 (after pruning): 5
Size of split for T2 (after pruning): 5
Size of split for TX: 4
Size of split for TY: 4

Again, what confuses me is the size of the bipartition encoding. Before pruning, the common taxon namespace has 8 taxa so it surprises me that the encoding for tree 1 (T1) has a length of 6. I understand that this is because T1 has 6 taxa and the encoding was calculated during the creation of the tree. Is there a way to recalculate the bipartition encoding for T1 so it will also be over the common taxon namespace of 8 taxa? T2 works in exactly this way since it was feed the initial taxon namespace for T1.

My real goal here is to prune these two trees down to the 4 taxa they have in common and then calculate the bipartition encoding. Pruning the trees and recalculating the encodings results in encodings of length 5? Which 5 taxa would these refer to? 

As expected, if I create new trees (TX and TY) from the Newick representations of the pruned trees they both have bipartition encoding of length 4.

Cheers,
Donovan

Jeet Sukumaran

unread,
Jan 31, 2017, 3:57:13 PM1/31/17
to dendrop...@googlegroups.com
(1) So, again, ``len(b.split_as_bitstring())'' is NOT reporting what you
think it is reporting. It is not reporting the size of the encoding, nor
the number of splits, nor the number tips, nor the number of taxa, etc.
It is reporting the maximum size that the taxon namespace ever was,
which is really only of internal book-keeping interesting. It is, in
fact, for most practical purposes, a pretty useless value.

(2) I am repeating this point again, because (a) it is the major source
of your confusion, and (b) I did state this in the previous email: the
length of the the split representation as bitstring is NOT the length of
the bipartition encoding.

(3) The number of elements in the bipartition encoding is exactly equal
to the number of splits plus 1. The number of splits for an unrooted
tree is exactly 2N-3 (where N=number of tips on the tree; for details on
this, you could look up, e.g. Felsenstein 2004). So the size of the
bipartition encoding is correct: for 5 taxa, it is 8; for 4 taxa it os 6.

(4) You can get the number of bipartitions on a tree by call ``len()``
on the return value of ``encode_bipartitions()'':

bipartitions = tree.encode_bipartitions()
print(len(bipartitions))

or looking up the dictionary attributed added to the tree when encoding
bipartitions:

print(len(tree.bipartition_edge_map))

(4) For your purposes, there is no need to purge the taxon namespace.
The only reason to purge the taxon namespace is when writing to a data
file in NEXUS format to a program that crashes or otherwise demands that
every tree should have a complete leaf set.

(5) Your stated goal can be achieved by something like the following
(pseudo-pseudo-code):

assert tree1.taxon_namespace is tree2.taxon_namespace
t1_taxa = set(nd.taxon for nd in tree1.leaf_node_iter())
t2_taxa = set(nd.taxon for nd in tree2.leaf_node_iter())
to_keep = t1_taxa.intersection(t2_taxa)
tree1.retain_taxa(to_keep)
tree2.retain_taxa(to_keep)
bp1 = tree1.encode_bipartitions()
bp2 = tree2.encode_bipartitions()

(5) "Pruning the trees and recalculating the encodings results in
encodings of length 5? Which 5 taxa would these refer to?" I am not sure
that this statement makes sense given a proper understanding of what
bipartition encodings are.

(6) "As expected, if I create new trees (TX and TY) from the Newick
representations of the pruned trees they both have bipartition encoding
of length 4." Unfortunately, this is a case of two wrongs making a
wrong. First, your expectation is wrong --- the bipartition encoding of
the new trees should be of length 6. But the reason that your incorrect
expectation is met is because you are printing out the wrong value and
reading it as the bipartition encoding length.

(7) "Before pruning, the common taxon namespace has 8 taxa so it
surprises me that the encoding for tree 1 (T1) has a length of 6." Here
is another case of two wrongs making another wrong. Your expectation
that an 8-tip tree does not have an encoding length of 6 is wrong (it
should and it does), and the value you are printing out
(len(tree1.seed_node.bipartition.split_as_bitstring())) is, anyway, as I
have noted in point (1) and (2) and in the previous email, is meaningless.
> > an email to dendropy-user...@googlegroups.com <javascript:>
> > <mailto:dendropy-user...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
>
>
>
> --------------------------------------
> Jeet Sukumaran
> --------------------------------------
> jeetsu...@gmail.com <javascript:>
> --------------------------------------
> Blog/Personal Pages:
> http://jeetworks.org/
> GitHub Repositories:
> http://github.com/jeetsukumaran <http://github.com/jeetsukumaran>
> <http://www.flickr.com/photos/jeetsukumaran/sets/>
> --------------------------------------
>
> --
> You received this message because you are subscribed to the Google
> Groups "DendroPy Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-user...@googlegroups.com>.

Donovan Parks

unread,
Jan 31, 2017, 4:30:05 PM1/31/17
to dendrop...@googlegroups.com
Hello Jeet,

Thank you for the detail response. I am still a little lost though. What I am after if the bitmask representation of a bipartition as explained in the DendroPy documentation as:

"A bipartition is modeled using a bitmask. This is a a bit array representing the membership of taxa, with the least-significant bit corresponding to the first taxon, the next least-signficant bit corresponding to the second taxon, and so on, till the last taxon corresponding to the most-significant bit. Taxon membership in one of two arbitrary groups, ‘0’ or ‘1’, is indicated by its corresponding bit being unset or set, respectively."

Is this not what is returned by "split_as_bitstring()"? If this is the case, it would seem that the ``len(b.split_as_bitstring())'' should be the number of extant taxa (tips) in the tree. Though, I take you point that for implementation reasons this may related to the maximum size of the taxon namespace.

Anyway, I think my real issue lays elsewhere and I was just confused that the length of the bitmask seem to follow some internal book-keeping rules. 

Cheers,
Donovan
> You received this message because you are subscribed to a topic in the Google Groups "DendroPy Users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/dendropy-users/7uQSi2oPxJE/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to dendropy-user...@googlegroups.com.

Jeet Sukumaran

unread,
Jan 31, 2017, 4:35:48 PM1/31/17
to dendrop...@googlegroups.com
Hi Donovan,

> Is this not what is returned by "split_as_bitstring()"?

It is.

> If this is the
> case, it would seem that the ``len(b.split_as_bitstring())'' should be
> the number of extant taxa (tips) in the tree. Though, I take you point
> that for implementation reasons this may related to the maximum size of
> the taxon namespace.

Exactly this. Basically, once a taxon is accessioned into a namespace,
its "index" slot remains there in perpetuity, even if it is later
deleted. This is an explicit design decision for stability and
robustness (among other things, we want to keep the taxon indexes
stable, and this avoids the wrong taxa being mistakenly cross-indexed
when, e.g., deleting a taxon and then adding a new one).

Perhaps if you described why you want this information or how you plan
to use it, I might be able to suggest an approach that might work or
maybe even will help you achieve your ultimate objective without getting
into the split bitstring representation length?


On 1/31/17 4:29 PM, Donovan Parks wrote:
> Hello Jeet,
>
> <mailto:dendropy-user...@googlegroups.com> <javascript:>
>>> > <mailto:dendropy-user...@googlegroups.com
> <mailto:dendropy-users%2Bunsu...@googlegroups.com> <javascript:>>.
>>> > For more options, visit https://groups.google.com/d/optout
>>> <https://groups.google.com/d/optout>.
>>>
>>> --
>>>
>>>
>>>
>>> --------------------------------------
>>> Jeet Sukumaran
>>> --------------------------------------
>>> jeetsu...@gmail.com <mailto:jeetsu...@gmail.com> <javascript:>
>>> --------------------------------------
>>> Blog/Personal Pages:
>>> http://jeetworks.org/
>>> GitHub Repositories:
>>> http://github.com/jeetsukumaran <http://github.com/jeetsukumaran>
>>> Photographs (as stream):
>>> http://www.flickr.com/photos/jeetsukumaran/
>>> <http://www.flickr.com/photos/jeetsukumaran/>
>>> Photographs (by galleries):
>>> http://www.flickr.com/photos/jeetsukumaran/sets/
>>> <http://www.flickr.com/photos/jeetsukumaran/sets/>
>>> --------------------------------------
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "DendroPy Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-users%2Bunsu...@googlegroups.com>
>>> <mailto:dendropy-user...@googlegroups.com
> <mailto:dendropy-users%2Bunsu...@googlegroups.com>>.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>>
>>
>>
>> --------------------------------------
>> Jeet Sukumaran
>> --------------------------------------
>> jeetsu...@gmail.com <mailto:jeetsu...@gmail.com>
>>
>> --------------------------------------
>> Blog/Personal Pages:
>> http://jeetworks.org/
>> GitHub Repositories:
>> http://github.com/jeetsukumaran
>> Photographs (as stream):
>> http://www.flickr.com/photos/jeetsukumaran/
>> Photographs (by galleries):
>> http://www.flickr.com/photos/jeetsukumaran/sets/
>> --------------------------------------
>>
>> --
>> You received this message because you are subscribed to a topic in the
> Google Groups "DendroPy Users" group.
>> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/dendropy-users/7uQSi2oPxJE/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
> dendropy-user...@googlegroups.com
> <mailto:dendropy-users%2Bunsu...@googlegroups.com>.

Donovan Parks

unread,
Jan 31, 2017, 5:28:32 PM1/31/17
to dendrop...@googlegroups.com
Hello Jeet,

Thank you for the help and explanations. Ultimately, it seems my confusion over the bitmasks just lead me in the wrong direction. I have everything sorted now.

Thanks,
Donovan

    > For more options, visit https://groups.google.com/d/optout
    <https://groups.google.com/d/optout>.

    --



    --------------------------------------
    Jeet Sukumaran
    --------------------------------------
    jeetsu...@gmail.com <mailto:jeetsu...@gmail.com> <javascript:>

    --------------------------------------
    Blog/Personal Pages:
        http://jeetworks.org/
    GitHub Repositories:
        http://github.com/jeetsukumaran <http://github.com/jeetsukumaran>
    Photographs (as stream):
        http://www.flickr.com/photos/jeetsukumaran/
    <http://www.flickr.com/photos/jeetsukumaran/>
    Photographs (by galleries):
        http://www.flickr.com/photos/jeetsukumaran/sets/
    <http://www.flickr.com/photos/jeetsukumaran/sets/>
    --------------------------------------

--
You received this message because you are subscribed to the Google
Groups "DendroPy Users" group.
To unsubscribe from this group and stop receiving emails from it, send

For more options, visit https://groups.google.com/d/optout.


--



--------------------------------------
Jeet Sukumaran
--------------------------------------
jeetsu...@gmail.com <mailto:jeetsu...@gmail.com>

--------------------------------------
Blog/Personal Pages:
   http://jeetworks.org/
GitHub Repositories:
   http://github.com/jeetsukumaran
Photographs (as stream):
   http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):
   http://www.flickr.com/photos/jeetsukumaran/sets/
--------------------------------------

--
You received this message because you are subscribed to a topic in the
Google Groups "DendroPy Users" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/dendropy-users/7uQSi2oPxJE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "DendroPy Users" group.
To unsubscribe from this group and stop receiving emails from it, send

For more options, visit https://groups.google.com/d/optout.

--



--------------------------------------
Jeet Sukumaran
--------------------------------------
jeetsu...@gmail.com
--------------------------------------
Blog/Personal Pages:
   http://jeetworks.org/
GitHub Repositories:
   http://github.com/jeetsukumaran
Photographs (as stream):
   http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):
   http://www.flickr.com/photos/jeetsukumaran/sets/
--------------------------------------

--
You received this message because you are subscribed to a topic in the Google Groups "DendroPy Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dendropy-users/7uQSi2oPxJE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dendropy-users+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages