Bug? in resolve_polytomies(rng=random)

33 views
Skip to first unread message

Yan Wong

unread,
Dec 4, 2014, 11:43:50 AM12/4/14
to dendrop...@googlegroups.com
I've just figured out why I'm having problems with seriously deep trees. These polytomies don't look very "randomly resolved" to me!

>>> import random
>>> taxa = TaxonSet([str(x) for x in range(50)])
>>> tree = treesim.star_tree(taxa)
>>> tree.resolve_polytomies(rng=random)
>>> tree.ladderize()
>>> print tree.as_ascii_plot(plot_metric='level')

        /---- 36                                                                                                                            
    /---+                                                                                                                                   
    |   |    /--- 46                                                                                                                        
    |   \----+                                                                                                                              
    |        |   /---- 1                                                                                                                    
    |        \---+                                                                                                                          
    |            |    /--- 19                                                                                                               
    |            \----+                                                                                                                     
    |                 |   /--- 15                                                                                                           
    |                 \---+                                                                                                                 
    |                     |   /---- 13                                                                                                      
    |                     \---+                                                                                                             
    |                         |    /--- 38                                                                                                  
    |                         \----+                                                                                                        
    |                              |   /---- 20                                                                                             
    |                              \---+                                                                                                    
    |                                  |    /--- 29                                                                                         
    |                                  \----+                                                                                               
    |                                       |   /---- 30                                                                                    
    |                                       \---+                                                                                           
    |                                           |    /--- 28                                                                                
    +                                           \----+                                                                                      
    |                                                |   /--- 37                                                                            
    |                                                \---+                                                                                  
    |                                                    |   /---- 27                                                                       
    |                                                    \---+                                                                              
    |                                                        |    /--- 5                                                                    
    |                                                        \----+                                                                         
    |                                                             |   /---- 24                                                              
    |                                                             \---+                                                                     
    |                                                                 |    /--- 2                                                           
    |                                                                 \----+                                                                
    |                                                                      |   /--- 23                                                      
    |                                                                      \---+                                                            
    |                                                                          |   /---- 43                                                 
    |                                                                          \---+                                                        
    |                                                                              |    /--- 21                                             
    |                                                                              \----+                                                   
    |                                                                                   \--- 9                                              
    |                                                                                                                                       
    |   /---- 8                                                                                                                             
    \---+                                                                                                                                   
        |    /--- 7                                                                                                                         
        \----+                                                                                                                              
             |   /---- 32                                                                                                                   
             \---+                                                                                                                          
                 |    /--- 41                                                                                                               
                 \----+                                                                                                                     
                      |   /--- 40                                                                                                           
                      \---+                                                                                                                 
                          |   /---- 14                                                                                                      
                          \---+                                                                                                             
                              |    /--- 44                                                                                                  
                              \----+                                                                                                        
                                   |   /---- 0                                                                                              
                                   \---+                                                                                                    
                                       |    /--- 49                                                                                         
                                       \----+                                                                                               
                                            |   /---- 47                                                                                    
                                            \---+                                                                                           
                                                |    /--- 4                                                                                 
                                                \----+                                                                                      
                                                     |   /--- 31                                                                            
                                                     \---+                                                                                  
                                                         |   /---- 6                                                                        
                                                         \---+                                                                              
                                                             |    /--- 10                                                                   
                                                             \----+                                                                         
                                                                  |   /---- 45                                                              
                                                                  \---+                                                                     
                                                                      |    /--- 16                                                          
                                                                      \----+                                                                
                                                                           |   /--- 11                                                      
                                                                           \---+                                                            
                                                                               |   /---- 18                                                 
                                                                               \---+                                                        
                                                                                   |    /--- 3                                              
                                                                                   \----+                                                   
                                                                                        |   /---- 33                                        
                                                                                        \---+                                               
                                                                                            |    /--- 26                                    
                                                                                            \----+                                          
                                                                                                 |   /---- 12                               
                                                                                                 \---+                                      
                                                                                                     |    /--- 17                           
                                                                                                     \----+                                 
                                                                                                          |   /--- 42                       
                                                                                                          \---+                             
                                                                                                              |   /---- 48                  
                                                                                                              \---+                         
                                                                                                                  |    /--- 34              
                                                                                                                  \----+                    
                                                                                                                       |   /---- 25         
                                                                                                                       \---+                
                                                                                                                           |    /--- 39     
                                                                                                                           \----+           
                                                                                                                                |   /---- 22
                                                                                                                                \---+       
                                                                                                                                    \---- 35

Jeet Sukumaran

unread,
Dec 4, 2014, 4:13:40 PM12/4/14
to dendrop...@googlegroups.com
I think the random resolution of polytomies of ``resolve_polytomies()``
works as advertised. Empirical proof is given by running the attached
script, with ensures that ``resolve_polytomies`` does, indeed, visit
every distinct topology when called multiple times. But you should read
further to understand what is going on, and why your visual assessment
is unreliable.

[NOTE: I have just pushed a change that does, in fact, update the
``resolve_polytomies()`` code. However, it really is more of clean-up
of the code than a major change in logic.]

You are assessing "random" by a *visual* inspection of a *single*
realization of polytomy resolution of a tree that has been *ladderized*.

(1) With ``resolve_polytomies()`` *over a very large number of calls*,
all possible strictly-bifurcating resolutions of a polytomy will be
realized, and any *single* call to ``resolve_polytomies()`` will yield
*one* of these strictly-bifurcating resolutions. You cannot in any way
assess this with one call on one tree. Now, I do not know if these
resolutions are going to be strictly equiprobable, even over an infinite
number of calls. But I am pretty sure that they will all be visited.

(2) Ladderizing a tree will completely obscure the most of the more
unbalanced resolutions to "look" the same.

To account for (1), you should run the process multiple times. Even
here, though, I would not trust a visual inspection of trees, and I
would outright reject any visual inspection of *ladderized* trees.

What needs to be done to check to see if ``resolve_polytomies()`` works
as advertised is to see if it does, indeed, result in all possible
bifurcations if called multiple times. And, indeed it does: run the
attached script. It generates a list of all topologies in PAUP, and then
calls ``resolve_polytomies`` on a star tree multiple times until all the
topologies have been visited. You can play around with the number of
taxa and whether or not the trees are rooted. I have run this with 8
taxa and unrooted, and it pretty quickly (as measured in time to get a
cup a coffee from the machine) visits the entire space of topologies.


-- jeet
> --
> You received this message because you are subscribed to the Google
> Groups "DendroPy Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dendropy-user...@googlegroups.com
> <mailto:dendropy-user...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--



--------------------------------------
Jeet Sukumaran
--------------------------------------
jeetsu...@gmail.com
--------------------------------------
Blog/Personal Pages:
http://jeetworks.org/
GitHub Repositories:
http://github.com/jeetsukumaran
Photographs (as stream):
http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):
http://www.flickr.com/photos/jeetsukumaran/sets/
--------------------------------------

exp1.py

Jeet Sukumaran

unread,
Dec 4, 2014, 4:43:06 PM12/4/14
to dendrop...@googlegroups.com
The attached script looks into the equiprobable issue. It simply counts
the number of times each distinct topology is seen in each resolution of
the polytomy over a large number of runs. It gets pretty close to being
even. Sufficiently so that I think we can say that the polytomy
resolution not only visits every topology, it visits every topology with
equal probability ...
exp4.py

Jeet Sukumaran

unread,
Dec 4, 2014, 6:02:53 PM12/4/14
to dendrop...@googlegroups.com
I just looked at the DendroPy 3 code, and it appears that, in this code
base, as the polytomy is being resolved, the newly attached child nodes
are not included in the pool of the attachment points. This would
preclude a subset of topologies from being produced (e.g., 15 out of the
105 unrooted 6 taxon topologies). Not that this does not mean that the
process is random, it is just not equiprobable over the entire space of
topologies (it is equiprobably over some subset of that space and
excludes the remainder of the space). Thus, my previous point about not
being able to assess the randomness of the resolution using a visualized
inspection of a ladderized version of a single run of
`resolve_polytomies` still stands, but nonetheless, your highlighting
of the issue did lead to identification of a problem and consequent fix.
So, thanks for that.

-- jeet

On 12/4/14, 11:43 AM, Yan Wong wrote:

Yan Wong

unread,
Dec 5, 2014, 7:47:42 AM12/5/14
to dendrop...@googlegroups.com
On Thursday, 4 December 2014 21:43:06 UTC, Jeet Sukumaran wrote:
The attached script looks into the equiprobable issue.

Thanks. Of course you are quite right that examining a single case can be very misleading (I only intended it as an example), and that ladderization conceals distinct topologies. I am pretty surprised that all the cases I have examined have highly nested branches with the vast majority (usually all but one) of the nodes maximally unbalanced, with a single descendant on one branch, and *all* the remaining descendants on the other branch. Given your testing (thanks for that), this may be just my mistaken assumption about the distribution of all possible trees. 

However, I do get radically different histograms for tree balance statistics from the trees generated by resolve_polytomies() and trees generated by the PDA model of equiprobable topologies. I have a feeling that I'm mistakenly equating equiprobable topologies with equiprobable trees - in the latter case, the uniqueness of label permutations presumably needs to be taken into account. The python code below demonstrates the unbalanced nature of the trees. Compare it with e.g. the equivalent in R: hist(sapply(rtreeshape(1000,tip.number=50,model="pda"),FUN=colless, norm="pda")) 

from dendropy import Tree, TreeList, treesim, TaxonNamespace
from dendropy.calculate import treemeasure
from dendropy.simulate import treesim
import matplotlib.pyplot as plt
import random

taxa = TaxonNamespace([str(x) for x in range(50)])
a=[]
for i in range(1000):
    tree = treesim.star_tree(taxa)
    tree.resolve_polytomies(rng=random)
    a.append(treemeasure.colless_tree_imbalance(tree, "pda"))

plt.hist(a)
plt.show()

Jeet Sukumaran

unread,
Dec 5, 2014, 10:55:52 PM12/5/14
to dendrop...@googlegroups.com
I really cannot think of a more direct test and evidence that
``resolve_polytomies`` is actually generating equiprobable (labeled)
trees than the script I posted earlier: i.e., actually enumerating and
summarizing frequencies of the trees generated over multiple replicates,
and seeing that tree sampling frequencies are indeed uniform over *all*
possible trees.

I am also not sure how inspection of an indirect measure, such as the
distribution of tree balance statistics, is a better assessment. Even
more dubious would be if this is not compared with a theoretical
expectation of the distribution under equiprobable trees, but rather
another program's output.

In any case, you are right that the discrepency might have been
explained if the R function generated tree *shapes* (i.e., unlabeled
trees) as opposed to label trees.

But I do not think that this is the case. The documentation for the
function states:

"""
The PDA model (Proportional to Distinguishable Arrangements) is not a
model of growing tree. Instead, each tree with n tips has the same
probability to be generated under this model. There is (2n − 3)!!
possible trees with n tips.
"""

The ``(2n-3)!!`` is indeed the number of distinct rooted bifurcating
*labeled* trees for $n$ taxa, so this would suggest that they are not
generating tree shapes but trees.

However, as you can see by running my attached script, while every tree
is visited by the apTreeshape ``rtreeshape`` function, they are most
emphatically *not* visited in anywhere close to the same frequencies.
There is an order of magnitude or so of difference in the distribution
of the trees sampled.

Am I missing something here? If not, then either apTreeshape
``rtreeshape()`` does not work as you expect it to work (i.e.,
equivalent to ``Tree.resolve_polytomies()``), or there is a bug in the R
program. Perhaps you should contact the apTreeshape authors?

-- jeet
check_R_tree_dist.py

Jeet Sukumaran

unread,
Dec 5, 2014, 11:01:10 PM12/5/14
to dendrop...@googlegroups.com
Incidentally, I use "tree shape" to refer to an *unlabeled* tree
(typically without edge lengths/weights). Typically, when I use
"topology", I refer to a *labeled* tree without consideration of edge
lengths/weights.

Yan Wong

unread,
Dec 9, 2014, 6:10:06 PM12/9/14
to dendrop...@googlegroups.com
On Saturday, 6 December 2014 03:55:52 UTC, Jeet Sukumaran wrote:
I really cannot think of a more direct test and evidence that
``resolve_polytomies`` is actually generating equiprobable (labeled)
trees than the script I posted earlier:

Yes, certainly yours is the most rigorous test of the distribution. One small problem is that this doesn't deal well with n_tips>9, where the space of possible trees is large, and this is where I'm finding the most unexpected behaviour of resolve_polytomies. But the main reason I haven't run it is that I don't have PAUP! However, I see that the allTrees(n, rooted=TRUE) function in the phangorm R package generates trees, and I have been experimenting with sampling from these trees with n_tips=9 (hence the long delay in my reply, for which I apologise).

For such a large space, checking the probability of visiting all states is not easy, so despite your valid concerns about summary statistics, I've stuck to comparing methods using colless, as this can be calculated easily within DendroPy. I still get very weird results: in particular, resolve_polytomies() seems never to produce trees with certain colless values that *are* produced by allTrees(). For example, this valid 9-leaf tree "((((5,(8,(4,1))),3),((7,6),2)),9);" gives a colless value of 16

>>> tree = Tree.get_from_string("((((5,(8,(4,1))),3),((7,6),2)),9);", "newick")
>>> treemeasure.colless_tree_imbalance(tree, None)
16.0

Yet I never get any trees with colless 16 from resolve_polytomies() (try upping "reps" to as much as you like)

>>> from dendropy import Tree, TreeList, treesim, TaxonNamespace
>>> from dendropy.calculate import treemeasure
>>> from dendropy.simulate import treesim
>>> import numpy as np
>>> import random
>>> 
>>> taxa = TaxonNamespace([str(x) for x in range(9)])
>>> dendropy_resolve_polytomies=[]
>>> for i in range(20000):
...     tree = treesim.star_tree(taxa)
...     tree.resolve_polytomies(rng=random)
...     dendropy_resolve_polytomies.append(treemeasure.colless_tree_imbalance(tree, None))
... 
>>> print np.asarray(np.unique(dendropy_resolve_polytomies, return_counts=True)).T
[[   10.  4971.]
 [   14.  5022.]
 [   20.  5016.]
 [   28.  4991.]]

A colless value of 16 is actually one of the most common for 9-leaf trees in my sample of all tree topologies. The following code shows quite a different distribution of colless values from randomly sampled 9-leaf labelled trees. I expect the same would be found for any balance measure.

>>> from dendropy import Tree, TreeList, treesim, TaxonNamespace
>>> from dendropy.calculate import treemeasure
>>> import numpy as np
>>> 
>>> phangorn_allTrees=[]
>>> trees = TreeList.get_from_path("sample.nwk", "newick")
>>> for tree in trees:
...     phangorn_allTrees.append(treemeasure.colless_tree_imbalance(tree, None)) 
>>> print np.asarray(np.unique(phangorn_allTrees, return_counts=True)).T
[[   3.   30.]
 [   4.   18.]
 [   6.  230.]
 [   7.  163.]
 [   9.  118.]
 [  10.  614.]
 [  11.   90.]
 [  12.  231.]
 [  13.  275.]
 [  14.  395.]
 [  15.  183.]
 [  16.  518.]
 [  17.  190.]
 [  18.  438.]
 [  19.   69.]
 [  20.  485.]
 [  21.  239.]
 [  22.  260.]
 [  23.  225.]
 [  24.  259.]
 [  25.  121.]
 [  28.  543.]]

where sample.nwk has been produced from R, using the following slow code (annoyingly, generating even 5000 trees takes some time in R):

library(phangorn)
trees <- allTrees(9, rooted=TRUE, 1:9)
for (i in sample(length(trees), 5000)) {write.tree(trees[[i]], file="sample.nwk", append = TRUE)} 
 
Even more dubious would be if this is not compared with a theoretical
expectation of the distribution under equiprobable trees, but rather
another program's output.

Yes, I agree. There is a theoretical colless distribution for larger PDA trees (Airy, see http://www.inside-r.org/packages/cran/apTreeshape/docs/colless), and I don't think resolve_polytomies produces this, but I haven't done any proper testing.
 
Perhaps you should contact the apTreeshape authors?

That is a good suggestion.

Apologies again for taking up your time with this, when you are obviously busy with getting DendroPy 4 in a decent state.

Yours

Yan

Jeet Sukumaran

unread,
Dec 9, 2014, 6:22:15 PM12/9/14
to dendrop...@googlegroups.com
Actually, I rather quickly get trees with colless scores of 16 when I
run your code:

###################
import random
import dendropy
from dendropy.model import treeshape
from dendropy.calculate import treemeasure

taxa = dendropy.TaxonNamespace([str(x) for x in range(9)])
for i in range(100):
tree = treeshape.star_tree(taxa)
tree.resolve_polytomies(rng=random)
c = treemeasure.colless_tree_imbalance(tree, None)
if c == 16:
print(tree.as_string("newick"))
#################
(3,((6,4),(((2,(8,7)),1),(0,5))));
(7,((4,(6,2)),(3,((8,(0,5)),1))));
(5,(((8,4),3),(((2,(7,6)),0),1)));
(((((8,(7,2)),(1,0)),4),5),(6,3));
((0,1),((7,(((5,2),3),(8,6))),4));
(8,(((0,7),(6,(4,(1,2)))),(3,5)));
((7,6),(5,(((2,(8,0)),(4,1)),3)));
(4,((0,(2,1)),(6,((8,(3,5)),7))));
###################

Am I missing anything?
Could you check what revision of DendroPy you are running?

On 12/9/14, 6:10 PM, Yan Wong wrote:
> Yet I never get any trees with colless 16 from resolve_polytomies() (try
> upping "reps" to as much as you like)

Jeet Sukumaran

unread,
Dec 9, 2014, 6:25:37 PM12/9/14
to dendrop...@googlegroups.com
No apologies need. In fact, the opposite: we truly appreciate you taking
your time to explore these aspects of DendroPy and report results,
anomalies, or things that you think do not work *and* report back. Some
of these issues, e.g., long-run patterns, are very difficult to catch in
unit- and regression tests, and it takes detailed usage or
stress-testing to figure out.

On 12/9/14, 6:10 PM, Yan Wong wrote:
> Apologies again for taking up your time with this, when you are
> obviously busy with getting DendroPy 4 in a decent state.

Jeet Sukumaran

unread,
Dec 9, 2014, 6:37:35 PM12/9/14
to dendrop...@googlegroups.com
Note that rooting state is a very important aspect of trees to be
managed within DendroPy, and DendroPy assumes unrooted trees if not
otherwise indicated. Depending on the assumptions R and other programs
make, this may lead to differences in some statistics. I would
explicitly set ``tree.is_rooted=True`` or ``tree.is_rooted=False``
depending on what you want/need.

Yan Wong

unread,
Dec 10, 2014, 8:43:20 AM12/10/14
to dendrop...@googlegroups.com
On Tuesday, 9 December 2014 23:22:15 UTC, Jeet Sukumaran wrote:
Could you check what revision of DendroPy you are running?

Aha - I'v finally tracked down the problem. I was running the latex GitHub version, but I hadn't uninstalled my previous 3.x version. I'm not quite sure why, but this seemed to cause resolve_phylogenies() to become biased in this way. After uninstalling everything and reinstalling from GitHub, the V.4 code works as I expect it to, and I no longer get massively unbalanced, >1000 node-depth trees.

Apologies for all the diversions, but I'm glad I resolved a small bias in the v.3 code.

Cheers for all your help

Yan
Reply all
Reply to author
Forward
0 new messages