Simplifying and extracting large trees

30 views
Skip to first unread message

Yan Wong

unread,
Jun 21, 2014, 10:34:40 AM6/21/14
to bio-...@googlegroups.com
Hi,

I'm thinking of using Bio::Phylo to mess around with the Open Tree of Life (~2.5 million nodes, from http://files.opentreeoflife.org/trees/). In particular, I want to extract subtrees and, below a certain level, fold multiple branches from nodes into a single, summary branch. Is there a Bio::Phyla function I can use that, given a node, collapses all the tree from that point down into a single branch? And will Bio::Phylo cope with a tree on this scale? If not, any suggestions for other newick parsing programs?

Cheers

Yan

Rutger Vos

unread,
Jun 21, 2014, 4:40:38 PM6/21/14
to bio-...@googlegroups.com
Hi Yan,

you will most likely not be able to load a tree of that size in RAM, not in Bio::Phylo and not in general in most phylogenetic programming toolkits in scripting languages (Perl/Python/R/Ruby/etc.). 

That said, what you can do is load the tree into a simple relational database (e.g. SQLite) and access that. I wrote an extension to Bio::Phylo that works in that way, so you can have the same API but it operates on database records instead of a tree in memory. This will be somewhat slower, but much more scalable. I'm pretty optimistic it would work for the OTOL tree. It does for the big greengenes tree, for example. 

Here's the code: https://github.com/rvosa/bio-phylo-megatree - it is a bit experimental so if you need any help please ask. The basic workflow is that you use the make_megatree script:

$ make_megatree -infile <otol.newick> -dbfile <otol.db>

Where <otol.newick> is the OTOL tree file in newick format, and <otol.db> is the output file as a SQLite db. Subsequently, in your code, you can do:

use Megatree;
my $tree = Megatree->connect("otol.db");

The $tree will have all the methods of a normal Bio::Phylo tree, but it will do the traversals by lookups in the database. As database lookups are slower than pulling something out of RAM you will want to be efficient in the number of traversals you do.

For your second question: I'm not entirely sure what you're trying to accomplish in the end. You're able to mark a node as "collapsed", but this only means something for the tree drawer module (which will draw the node and all its descendants as a triangle). If this is not about visualization but about reporting something at certain nodes, you can make this work with the traversal methods, most likely, though I need a bit more information.

Hope this helps,

Rutger


--
You received this message because you are subscribed to the Google Groups "bio-phylo" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bio-phylo+...@googlegroups.com.
To post to this group, send email to bio-...@googlegroups.com.
Visit this group at http://groups.google.com/group/bio-phylo.
For more options, visit https://groups.google.com/d/optout.

Yan Wong

unread,
Jun 26, 2014, 7:14:05 AM6/26/14
to bio-...@googlegroups.com
On Saturday, 21 June 2014 21:40:38 UTC+1, Rutger Vos wrote:
Hi Yan,

you will most likely not be able to load a tree of that size in RAM, not in Bio::Phylo and not in general in most phylogenetic programming toolkits in scripting languages (Perl/Python/R/Ruby/etc.). 

Thanks, I was afraid that might be the case. I've just hacked a perl script together that incrementally reads the tree in reverse, split by the close brace character, so it doesn't need to store the whole thing in memory. When it finds the required nodes it starts saving the subtree, then prints the subtree out when it reached the end of the nested braces.
 
That said, what you can do is load the tree into a simple relational database (e.g. SQLite) and access that.

That's very useful to know, thanks.
 
For your second question: I'm not entirely sure what you're trying to accomplish in the end.

It's just that with huge trees, it's helpful not to work with the leaf nodes, but replace them with an exemplar. So for instance, replace the 400 odd Senecio species with a single leaf to represent the entire genus. But I've sorted something out with my little perl script to do this anyway.
 
Hope this helps,

Thanks of the reply. By the way when I unparse a Bio::Phylo tree in Newick format, it loses any name given to the root node. Should there be an option to output it?

Yan

Rutger Vos

unread,
Jun 26, 2014, 8:13:33 AM6/26/14
to bio-...@googlegroups.com
 
Hope this helps,

Thanks of the reply. By the way when I unparse a Bio::Phylo tree in Newick format, it loses any name given to the root node. Should there be an option to output it?

Yes! $tree->to_newick( -nodelabels => 1 )

Yan Wong

unread,
Jun 26, 2014, 8:23:52 AM6/26/14
to bio-...@googlegroups.com
Fantastic. Thanks. I couldn't find it in the documentation, but perhaps I didn't look hard enough.
Reply all
Reply to author
Forward
0 new messages