how to calculate tree distances without using a lot of memory?

50 views
Skip to first unread message

Adam Bazinet

unread,
Mar 6, 2012, 4:11:27 PM3/6/12
to dendrop...@googlegroups.com
Here is a simple script to calculate the symmetric distance between one tree (best_tree) and a set of other trees (all_trees):

#!/usr/bin/python                                                                                                                                                  

import dendropy

best_tree = dendropy.Tree.get_from_path("best_tree.tre", schema="nexus")
all_trees = dendropy.TreeList.get_from_path("all_trees", schema="nexus")

output_f = open('treedist_output', 'w')
counter = 1

for tree in all_trees:
    distance = best_tree.symmetric_difference(tree)
    output_f.write("1 " + str(counter) + " " + str(distance) + "\n")
    counter += 1

output_f.close()

In a particular example, all_trees contained 1000 trees each having 300+ taxa. I found in that case that this script took ~5 minutes to run, and used 2.3G of memory. It's the memory that I'm most concerned about; is there any more efficient way to do this with DendroPy?

thanks,
Adam


Jeet Sukumaran

unread,
Mar 7, 2012, 1:19:02 PM3/7/12
to dendrop...@googlegroups.com
Hi Adam,

You could use a tree iterator to only keep one tree in memory at a time instead of reading the entire list of trees at once.

This is discussed here:

    http://packages.python.org/DendroPy/tutorial/trees.html#efficiently-iterating-over-trees-in-a-file

Note that for correct results, you want to be sure to use the same TaxonSet object to manage the taxa in your best tree as well as the other trees. So, for example:

#!/usr/bin/python                                                                                                                                                  

import dendropy

best_tree = dendropy.Tree.get_from_path("best_tree.tre", schema="nexus")

output_f = open('treedist_output', 'w')
counter = 1

for tree in tree_source_iter(
        stream=open('all_trees', 'rU'),
        schema='nexus',
taxon_set=best_tree.taxon_set):

    distance = best_tree.symmetric_difference(tree)
    output_f.write("1 " + str(counter) + " " + str(distance) + "\n")
    counter += 1

output_f.close()

Jeet Sukumaran

unread,
Mar 7, 2012, 1:20:33 PM3/7/12
to DendroPy Users
p.s. bug fix: you will need to add:

from dendropy import tree_source_iter

or otherwise qualify 'tree_source_iter' with the dendropy namespace.

On Mar 7, 1:19 pm, Jeet Sukumaran <jeetsukuma...@gmail.com> wrote:
> Hi Adam,
>
> You could use a tree iterator to only keep one tree in memory at a time
> instead of reading the entire list of trees at once.
>
> This is discussed here:
>
> http://packages.python.org/DendroPy/tutorial/trees.html#efficiently-i...

Adam Bazinet

unread,
Mar 8, 2012, 11:59:32 AM3/8/12
to dendrop...@googlegroups.com
It works like a charm -- only uses ~45M of memory now on that example, and runs faster too (< 2 minutes).

Thanks Jeet.

-adam
Reply all
Reply to author
Forward
0 new messages