Newick -> Phyloxml -> add Annotations

208 views
Skip to first unread message

Fabian Schreiber

unread,
Aug 6, 2012, 11:13:25 AM8/6/12
to bio-...@googlegroups.com
Hello!
For a project, I have a bunch of newick trees which I would like to convert to phyloxml and add annotations
to internal and terminal nodes.

I still haven't got my mind around using Bio::Phylo::Unparsers::Phyloxml to convert newick to phyloxml.

Once that is done, do you happen to have a code snippet to add annotations to a phyloxml tree?

Here is my code so far:

use Bio::Phylo::IO;
use Bio::Phylo::Treedrawer;
use Bio::Phylo::Forest::Tree;
use Bio::Phylo::Parsers::Phyloxml;
my $string = '((A,B),C);';
my $forest = Bio::Phylo::IO->parse(
                 -format => 'newick',
                 -file => $string
              );
my $tree = $forest->first;

my $phyloxml_tree = $tree->to_string;


Thanks a lot!

Cheers!

Rutger Vos

unread,
Aug 7, 2012, 2:10:20 AM8/7/12
to bio-...@googlegroups.com
Hi Fabian,

I realize that how this is done is not (ahem) immediately obvious, so
I wrote a blog post with an example, the post is here:
http://biophylo.blogspot.com/2012/08/attaching-phyloxml-annotations.html

The example in the post is the following:

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Phylo::Factory;
use Bio::Phylo::IO qw'parse unparse';
use Bio::Phylo::Util::CONSTANT qw':objecttypes :namespaces';

# let's say we want to attach species codes such
# as the ones that archaeopteryx uses for gene tree /
# species tree reconciliation. This means that every
# tip in the tree needs to link to an OTU to which we
# attach the code.
my %codes = (
Homo_sapiens => 'HS',
Pan_troglodytes => 'PT',
Gorilla_gorilla => 'GG',
);

# we parse the newick as a project so that we end
# up with an object that has both the tree and the
# annotated OTUs
my $proj = parse(
'-format' => 'newick',
'-handle' => \*DATA,
'-as_project' => 1,
);

# here we make the OTUs
my ($forest) = @{ $proj->get_items(_FOREST_) };
my $taxa = $forest->make_taxa;
$proj->insert($taxa);

# it's easier to make a factory object for creating the annotations
my $fac = Bio::Phylo::Factory->new;

# here we annotate the OTUs
$taxa->visit(sub{
my $taxon = shift;
my $name = $taxon->get_name;
my $code = $codes{$name};
$taxon->add_meta(
$fac->create_meta(
'-namespaces' => { 'pxml' => _NS_PHYLOXML_ },
'-triple' => { 'pxml:code' => $code },
)
);
});

# now write the output
print unparse( '-format' => 'phyloxml', '-phylo' => $proj );

__DATA__
((Homo_sapiens,Pan_troglodytes),Gorilla_gorilla);
> --
> You received this message because you are subscribed to the Google Groups
> "bio-phylo" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/bio-phylo/-/i19AwRGhgOwJ.
> To post to this group, send email to bio-...@googlegroups.com.
> To unsubscribe from this group, send email to
> bio-phylo+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/bio-phylo?hl=en.



--
Dr. Rutger A. Vos
Bioinformaticist
NCB Naturalis
Visiting address: Office A109, Einsteinweg 2, 2333 CC, Leiden, the Netherlands
Mailing address: Postbus 9517, 2300 RA, Leiden, the Netherlands
http://rutgervos.blogspot.com

Rutger Vos

unread,
Aug 7, 2012, 7:22:04 AM8/7/12
to Fabian Schreiber, bio-...@googlegroups.com
> I planning to use Archaeopteryx to display phylogenetic trees in the next
> TreeFam release.

Cool! As it happens I'm now sitting next to Christian Zmasek (in
Moscow, of all places). I assume he's aware of your plans? Small world
:-)

> I hope it is ok, to ask you two more questions. I am totally stuck and have
> no clue how to solve it. Your help is much appreciated!

Absolutely, very welcome.

> Using my $taxa = $forest->make_taxa you create a taxon block. This allows
> you to iterate over
> all terminal taxa using $taxa->visit.
>
> Question: How can you do the same with internal nodes?

There's a number of different ways. An easy way is with $tree->visit.
Both trees and taxa inherit from Bio::Phylo::Listable, so both can be
used in the same way. Also, nodes have a method $node->is_internal
(and $node->is_terminal, $node->is_root) so you should be able to hack
something together with that :-)

A downside of doing it that way is that it treats the tree simply as a
flat list of nodes, so if you want to do something topologically
clever (like compute paths) you might be interested in:
$tree->visit_depth_first, which takes named arguments whose values are
subroutine references, i.e.:

$tree->visit_depth_first(
'-pre' => sub { print "executed in pre-order traversal, i.e. parent
before child" },
'-post' => sub { print "executed in post-order traversal, i.e. child
before parent" },
);

> 2. I would like to display domain architecture information.
> The corresponding xml looks like this and works in Archaeopteryx.
> <sequence>
> <domain_architecture length="1249">
> <domain from="1" to="245" confidence="7.0E-26">COX1</domain>
> <domain from="1168" to="1204" confidence="0.3">piwi</domain>
> </domain_architecture>
> </sequence>

This can be done, easily but in a very ugly way :-)

I never got around to doing the domain architecture annotations
properly, but you can do something like this, i.e. by inserting XML
strings:

my $tree = $forest->first;
$tree->visit(sub{
my $node = shift;
my $arch = _create_dummy_architecture();
$node->add_meta(
$fac->create_meta(
'-namespaces' => { 'pxml' => _NS_PHYLOXML_ },
'-triple' => { 'pxml:sequence' => $arch },
)
);
});

sub _create_dummy_architecture {
return
'<domain_architecture length="1249">
<domain from="1" to="245" confidence="7.0E-26">COX1</domain>
<domain from="1168" to="1204" confidence="0.3">piwi</domain>
</domain_architecture>';
}

The result looks like this, which I think is roughly what you are after:

<?xml version="1.0" encoding="UTF-8"?>
<phyloxml xmlns="http://www.phyloxml.org"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.phyloxml.org
http://www.phyloxml.org/1.10/phyloxml.xsd">
<phylogeny rooted="true">
<clade>
<sequence>
<domain_architecture length="1249">
<domain confidence="7.0E-26" from="1" to="245">COX1</domain>
<domain confidence="0.3" from="1168" to="1204">piwi</domain>
</domain_architecture>
</sequence>
<clade>
<sequence>
<domain_architecture length="1249">
<domain confidence="7.0E-26" from="1" to="245">COX1</domain>
<domain confidence="0.3" from="1168" to="1204">piwi</domain>
</domain_architecture>
</sequence>
<clade>
<name>Homo_sapiens</name>
<sequence>
<domain_architecture length="1249">
<domain confidence="7.0E-26" from="1" to="245">COX1</domain>
<domain confidence="0.3" from="1168" to="1204">piwi</domain>
</domain_architecture>
</sequence>
<taxonomy>
<code>HS</code>
</taxonomy>
</clade>
<clade>
<name>Pan_troglodytes</name>
<sequence>
<domain_architecture length="1249">
<domain confidence="7.0E-26" from="1" to="245">COX1</domain>
<domain confidence="0.3" from="1168" to="1204">piwi</domain>
</domain_architecture>
</sequence>
<taxonomy>
<code>PT</code>
</taxonomy>
</clade>
</clade>
<clade>
<name>Gorilla_gorilla</name>
<sequence>
<domain_architecture length="1249">
<domain confidence="7.0E-26" from="1" to="245">COX1</domain>
<domain confidence="0.3" from="1168" to="1204">piwi</domain>
</domain_architecture>
</sequence>
<taxonomy>
<code>GG</code>
</taxonomy>
</clade>
</clade>
</phylogeny>
</phyloxml>

Best wishes,

Rutger

Fabian Schreiber

unread,
Aug 7, 2012, 6:45:58 AM8/7/12
to bio-...@googlegroups.com, Rutger Vos
Thanks, Rutger!
Very helpful and inspiring blog post.

The code you provided works like a charm for me :-)
 
I planning to use Archaeopteryx to display phylogenetic trees in the next TreeFam release.
I hope it is ok, to ask you two more questions.  I am totally stuck and have no clue how to solve it. Your help is much appreciated!

1. Besides adding annotations to terminal nodes, I am interested in adding speciation/duplication information to inner nodes.

The corresponding xml part might look like this:
<clade>
     <name>Euarchontoglires</name>
<events>
                            <speciations>1</speciations>
                    </events>
                  </name>
...    
  Using my $taxa = $forest->make_taxa you create a taxon block. This allows you to iterate over 
  all terminal taxa using $taxa->visit.

  Question: How can you do the same with internal nodes?                    
                     
                     
2. I would like to display domain architecture information.
The corresponding xml looks like this and works in Archaeopteryx.
<sequence>
<domain_architecture length="1249">
<domain from="1" to="245" confidence="7.0E-26">COX1</domain>
<domain from="1168" to="1204" confidence="0.3">piwi</domain>
</domain_architecture>
</sequence>

I am having problems setting the attributes for e.g. "domain_architecture".
Furthermore, using the code you kindly provided, every meta-information seems
to be embedded into a <taxonomy>-block. But the domain information should
be in its own <sequence>-block.

My xml looks like this:
<clade>
      <name>Gorilla_gorilla</name>
        <taxonomy>
          <code>GG</code>
          <domain_architecture>
            <domain>Piwi</domain>
          </domain_architecture>
        </taxonomy>
      </clade>

The code is here:

$taxon->add_meta(
        $fac->create_meta(
            '-namespaces' => { 'pxml'  => 'http://www.phyloxml.org/1.10/terms#' },
            '-triple' => { 'pxml:code' => $code },
)
); 
## Add Taxon image
$taxon->add_meta(
$fac->create_meta(
            '-namespaces' => { 'pxml'  => 'http://www.phyloxml.org/1.10/terms#' },
            '-triple' => { 'pxml:uri' => 'http://www.primates.com/homo/homo-sapiens.jpg' },
        ),
    );
## Add domain architecture
my $domain_architecture =$fac->create_meta(
'-namespaces' =>{ 'pxml'  => 'http://www.phyloxml.org/1.10/terms#' },
'-triple' => { 'pxml:domain' => 'Piwi', 'pxml:length' =>'324'},
);
$taxon->add_meta(
$fac->create_meta(
            '-namespaces' => { 'pxml'  => 'http://www.phyloxml.org/1.10/terms#' },
            '-triple' => {
'pxml:domain_architecture' => $domain_architecture
}
        )
    );

Do you have an idea on how to implement the two scenarios?

Thanks a lot in advance!

Fabian

--
Dr. Fabian Schreiber
TreeFam Project Leader
The Wellcome Trust Sanger Institute

t: 01223 494726
--
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Li Yang

unread,
Apr 13, 2015, 10:11:58 PM4/13/15
to bio-...@googlegroups.com, Fab.Sc...@gmail.com
I want to parse a phyloxml: get the taxonomy code (or scientific_name in sequence block ) and replace node name with it . Is there any example like this problem? Thanks so much! 

在 2012年8月7日星期二 UTC+8下午7:22:04,Rutger Vos写道:

Rutger Vos

unread,
Apr 16, 2015, 9:17:43 AM4/16/15
to bio-...@googlegroups.com, Fabian Schreiber
Hi Li,

sorry about the slow response! If you still need this, it should be pretty easy to do - do you have an example file so I can write a little script that demonstrates this? I do have to point out that phyloxml in principle uses the scientific_name as a way to indicate that multiple sequences (and therefore, multiple tips in the same tree) belong to the same species. The intent in phyloxml is that you would use this for example to look for gene duplications. The outcome of copying these scientific_name fields to tips in a tree could therefore be that multiple tips end up with the same name. Bio::Phylo doesn't care about this - the names are just a label - but other programs definitely do care if the same label occurs twice. Is that something that can happen with your data? Is it a problem?

Best wishes,

Rutger

--
You received this message because you are subscribed to the Google Groups "bio-phylo" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bio-phylo+...@googlegroups.com.

To post to this group, send email to bio-...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages