Fast count of number of nodes

已查看 17 次
跳至第一个未读帖子

Yan Wong

未读,
2018年12月28日 19:20:542018/12/28
收件人 DendroPy Users
I have a large tree (2.2 million tips) read in from a newick file. Iterating over nodes takes a while. I'd like to get a rough estimate of the total number of nodes in the tree so I can give a progress bar when iterating over nodes. Rather than do `n_nodes = sum(1 for _ in tree.preorder_node_iter())` which requires iteration, I thought I could simply do `n_nodes = len(tree.bipartition_encoding)`. Is there any reason why this would be a bad idea?

Alexey Morozov

未读,
2018年12月28日 20:30:132018/12/28
收件人 dendrop...@googlegroups.com
If you know the number of leaves, a number of internal nodes is trivial to calculate. See http://courses.cs.vt.edu/~cs3114/Fall09/wmcquain/Notes/T03a.BinaryTreeTheorems.pdf for the formulae.

сб, 29 дек. 2018 г., 8:20 'Yan Wong' via DendroPy Users dendrop...@googlegroups.com:
I have a large tree (2.2 million tips) read in from a newick file. Iterating over nodes takes a while. I'd like to get a rough estimate of the total number of nodes in the tree so I can give a progress bar when iterating over nodes. Rather than do `n_nodes = sum(1 for _ in tree.preorder_node_iter())` which requires iteration, I thought I could simply do `n_nodes = len(tree.bipartition_encoding)`. Is there any reason why this would be a bad idea?

--
You received this message because you are subscribed to the Google Groups "DendroPy Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dendropy-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yan Wong

未读,
2018年12月29日 19:34:102018/12/29
收件人 DendroPy Users


On Saturday, 29 December 2018 09:30:13 UTC+8, Alexey Morozov wrote:
If you know the number of leaves, a number of internal nodes is trivial to calculate. See http://courses.cs.vt.edu/~cs3114/Fall09/wmcquain/Notes/T03a.BinaryTreeTheorems.pdf for the formulae.

Only for a binary tree, surely. Also, I see that bipartitions are not immediately calculated after tree reading, so the bipartition method won't work for me either. Hmm

Jeet Sukumaran

未读,
2018年12月29日 20:54:502018/12/29
收件人 DendroPy Users
As Alexey Morozov notes, you can indeed get the count of nodes through various standard formulae, depending on whether the tree is rooted or not, whether you want internal nodes only or not etc.

You are also absolutely correct, of course, that these would not apply if the tree might have nodes with outdegree > 2.

In this case, you will indeed have to somehow pretraverse the tree to get the node count.

This is, in fact, what is implicitly done by encoding the bipartitions which, while not done automatically, can be requested:

```
n_nodes = len(tree.encode_bipartitions())
```

though for your purposes, if you are not actually using the bipartitions, you will find it faster to simply build a list from the iterator instead:

```
nodes = list(tree.preorder_node_iter())
n_nodes = len(nodes)
```

The most efficient way (perhaps only marginally faster than the list approach) might be to count nodes as the tree is being built from the source by passing in a node counting function to the tree reading method using the ``finish_node_fn`` argument. This is shown by the ``f3()`` approach below. 

```
#! /usr/bin/env python
# -*- coding: utf-8 -*-

import timeit
import dendropy

# target_data_path = "pythonidae.mle.newick"
target_data_path = "Smith_2001_angiosperms.newick"
target_data_schema = "newick"

def f1():
    tree = dendropy.Tree.get(
            path=target_data_path,
            schema=target_data_schema)
    print(len(tree.encode_bipartitions()))

def f2():
    tree = dendropy.Tree.get(
            path=target_data_path,
            schema=target_data_schema)
    print(len(list(tree.preorder_node_iter())))

def f3():
    node_data = {"count": 0}
    def count_nodes(x):
        node_data["count"] += 1
    tree = dendropy.Tree.get(
            path=target_data_path,
            schema=target_data_schema,
            finish_node_fn=count_nodes)
    n_nodes = node_data["count"]
    print(n_nodes)

def measure(fn, label, timeit_number=1, timeit_repeat=1):
    t1 = timeit.repeat(
            fn,
            repeat=timeit_repeat,
            number=timeit_number)
    print("'{}': {}".format(label, t1))


measure(f1, "encode_bipartitions")
measure(f2, "list(tree.preorder_node_iter())")
measure(f3, "finish_node_fn()")

```
回复全部
回复作者
转发
0 个新帖子