I'm trying to truncate the
GtDB tree of Bacteria to retain only selected phyla and remove other phyla (with all their descendants) from the tree. In addition, I would like to further truncate the tree to keep all taxa of the selected phyla down to the order level (remove everything below order).
I spent hours trying to get this work, but failed. Here is what I have so far:
import dendropy
from biolib.newick import parse_label
PHYLA_TO_RETAIN = set(['p__Patescibacteria', 'p__Planctomycetota'])
tree = dendropy.Tree.get_from_path('bac120.tree',
schema='newick',
rooting='force-rooted',
preserve_underscores=True)
taxa_in_tree = set()
for node in tree.postorder_node_iter():
if not node.is_leaf():
support, taxon, _auxiliary_info = parse_label(node.label)
if taxon in PHYLA_TO_RETAIN:
for leaf in node.leaf_iter():
taxa_in_tree.add(leaf.taxon)
PHYLA_TO_RETAIN.remove(taxon)
if not PHYLA_TO_RETAIN:
break
tree.retain_taxa(taxa_in_tree)
tree.write_to_path('tree.newick',
schema='newick',
suppress_rooting=True,
unquoted_underscores=True)
I can also provide the list of orders to retain/remove.