Placing sequences on the Silva tree

442 views
Skip to first unread message

sinclair

unread,
Oct 30, 2012, 12:33:16 PM10/30/12
to pplace...@googlegroups.com
Hello,

pplacer seems like a convenient and fast software package for placing new SSU or LSU sequences on a tree. I would like to use it to place sequences on the latest issue of the Silva tree. (http://www.arb-silva.de/no_cache/download/archive/current/ARB_files/). They provide ARB files. It is not too difficult to produce a newick tree and a fasta alignment from these files. I understand reading http://fhcrc.github.com/taxtastic/quickstart.html that that is what you need to create a reference package. However one piece of information is missing. That is "tree_stats.txt". How would one go about getting that from an ARB file ? And do you think what I am trying to achieve is a good use for your software ?

Erick Matsen

unread,
Oct 30, 2012, 8:08:54 PM10/30/12
to pplace...@googlegroups.com
Hello there Lucas--


Thank you for your interest in pplacer.

The statistics file is something that you get from building a ML
phylogenetic tree.

However, I do not think that this is a good use of pplacer, or of
phylogenetics as it is currently practised. The primary reason has to
do with the very large percentage of gaps in the SILVA alignment. Have
you looked at it? Because of this, the phylogenetic structure of the
tree (and the locations of the placements) are determined by the
pattern of gaps in the alignment, which were determined by the
alignment heuristic.

I can also say that pplacer's likelihood placements will use a lot of
memory, and not be very fast, for a tree of that size.

Let me know if that doesn't make sense.


Erick
--
Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

Morgan Langille

unread,
Oct 30, 2012, 8:30:21 PM10/30/12
to pplace...@googlegroups.com
Hi Erick,

I am curious about this application of pplacer as well. I don't know
Silva very well, but I have used pplacer for assigning 16S reads to
the Greengenes tree. In that case, I had to re-make the tree using
raxml or fasttree so I could get the stats file. I am a bit surprised
you don't think it is a good application of pplacer. The alignment is
gappy, but this is usually all taken out before tree building by using
the lane mask.

Also, a recent paper that uses pplacer to assign reads into the
greengenes tree was just recently published:
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002743

Of course if there is some important distinction between Silva and
GreenGenes then excuse my ignorance.

Sincerely,

Morgan Langille
--
Morgan Langille
http://morganlangille.com

Erick Matsen

unread,
Oct 30, 2012, 9:13:35 PM10/30/12
to pplace...@googlegroups.com
Hello Morgan--


Thanks for your email-- I'm glad to be having this discussion.

It all depends on what is going to happen downstream with the
placements. For example, I don't think that using a comprehensive
database like GG or SILVA is going to give good results for
high-resolution taxonomic classification. However, for Steve K's
application I think that uncertainty in placement is pretty
acceptable, given that the main source of error (in my mind) is the
model used for 16S copy number change in the tree.

I may be remembering wrong, but I think the last time I looked at
SILVA it was 90% gaps. Even if a lane mask is applied, it still means
that the location of the informative positions is determined by the
alignment algorithm, which as I described can then have a very strong
impact on the eventual placement.

What do you think? Have you chatted about this with Rob?


Erick

Conor Meehan

unread,
Oct 30, 2012, 9:20:22 PM10/30/12
to pplace...@googlegroups.com

Hi Eric,

I've used the green genes alignment with some additions and with application a lane mask for placements with pplacer and it does quite well so far. I feel that these alignments would be the main ones people would want to use for reference packages so ensuring it's clear what can and cannot be used with pplacer with would be paramount.

Conor

Morgan Langille

unread,
Oct 30, 2012, 9:52:31 PM10/30/12
to pplace...@googlegroups.com
Well, I don't think anything right now is great at high-resolution
taxonomic classification, so I think pplacer would do just as good (or
better) as something like RDP. No one in our group has done a through
investigation of how well pplacer is working using GG, but from a few
test cases it certainly seems better than something like RDP.

I guess you are worried that pplacer might not know the exact
placement due to so much data. Also it seems that the large number of
gaps and thus the location of informative positions would be a problem
with any tree building program and yet GG and SILVA are making these
large trees.

Considering the large demand for decent taxonomic classification using
16S and that most people seem happy settling for answers from the
quick and dirty RDP classification it seems like it might be worth
pursuing this a bit more? I have built a pipeline (similar to the one
outlined in Steve Kembels recent paper) that takes 16S reads aligns
them to the GG alignment using PyNAST, does the trimming etc, so that
pplacer can be used. Conor (previous email) in our group has been
using this quite a bit for taxonomic classification so he can probably
comment more on successes on that front.

Also, I know that Jonathan Eisen's group is working on a taxonomic
classification using pplacer on protein coding genes (for WGS
metagenomics). However, maybe it would be worth discussing a brief
project for 16S data. If interested we could chat about this offline.

Morgan

Erick Matsen

unread,
Oct 31, 2012, 12:28:59 AM10/31/12
to pplace...@googlegroups.com
> Well, I don't think anything right now is great at high-resolution
> taxonomic classification, so I think pplacer would do just as good (or
> better) as something like RDP. No one in our group has done a through
> investigation of how well pplacer is working using GG, but from a few
> test cases it certainly seems better than something like RDP.
>
> I guess you are worried that pplacer might not know the exact
> placement due to so much data. Also it seems that the large number of
> gaps and thus the location of informative positions would be a problem
> with any tree building program and yet GG and SILVA are making these
> large trees.

Yes, they are making these large trees. It's not always clear that their
overall goal is phylogenetic accuracy. Members of the Knight group have
made it clear to me that they aren't so bothered about accuracy, as long
as it doesn't change their UniFrac estimates. You can see this in, e.g.
http://pynast.wordpress.com/2010/04/06/pynast-1-1-released-better-alignments-and-a-15x-speed-increase/

That makes perfect sense to me-- we really just need to scale the method
to the application like I said for 16s estimates.

> Considering the large demand for decent taxonomic classification using
> 16S and that most people seem happy settling for answers from the
> quick and dirty RDP classification it seems like it might be worth
> pursuing this a bit more? I have built a pipeline (similar to the one
> outlined in Steve Kembels recent paper) that takes 16S reads aligns
> them to the GG alignment using PyNAST, does the trimming etc, so that
> pplacer can be used. Conor (previous email) in our group has been
> using this quite a bit for taxonomic classification so he can probably
> comment more on successes on that front.

Yes, we also have something like that (not surprisingly). However, the
direction we have headed is to break a big alignment into
sub-alignments, and first "bin" reads into the sub alignments using a
naive Bayes classifier. Then placement. We're currently validating this
approach, but are as usual chasing our tail with taxonomic things.

> Also, I know that Jonathan Eisen's group is working on a taxonomic
> classification using pplacer on protein coding genes (for WGS
> metagenomics). However, maybe it would be worth discussing a brief
> project for 16S data. If interested we could chat about this offline.

Yes. I'm not saying that using pplacer on big trees and alignments is
the worst thing ever, but I wouldn't say that using it on the SILVA tree
can be recommended without reservation.

Once when I congratulated Morgan Price on FastTree 2.1, he commented
"yeah, we can build these big trees now, so now we need to figure out if
they mean anything."

I should make it clear in my note of caution, though, that I think that
the GG and SILVA folks are doing great work on a hard problem.

Morgan Langille

unread,
Oct 31, 2012, 6:28:45 AM10/31/12
to pplace...@googlegroups.com
Agreed on all points. Thanks for the discussion!

Morgan

sinclair

unread,
Oct 31, 2012, 8:03:56 AM10/31/12
to pplace...@googlegroups.com
It is interesting to read the discussion that has followed. My goal for the moment is not necessarily to have an evolutionary correct phylogenetic tree, but more to quantify the taxonomic assignments of several SSU sequences I have for a candidate division bacterial. I have sequences from a taq study, a metagenomic study, and several single cell amplified genome sequences. I would also like to know how all these sequences relate to each other.

I see that maybe the choice of green genes is preferred at the moment for this kind of experimentation. I just picked SILVA because it seemed a more active project and more up to date.

But my questions remains, how should I go about obtaining the information usually found in the RAxML info file ? Should I necessarily rebuild the tree for this ? That seems like a very computationally expensive operation.

Could one imagine in the future having a GG or SILVA pplacer reference package made available ? Or might Morgan Langille who says he already assigns 16S reads to the green genes tree, kindly share more details on the steps taken to do this ?

Thanks.

Conor Meehan

unread,
Oct 31, 2012, 8:16:55 AM10/31/12
to pplace...@googlegroups.com
We (Morgan and I) have a green genes reference package that I built a while ago. I recovered the stats file of the green genes tree using raxml which allows you to get likelihood scores etc of a given tree.

I've attached the green genes reference package I created for use with pplacer. Maybe Eric can tell me if I did something wrong? It seems to work very well and give pretty accurate placement results.


Cheers,

Conor


--
Conor Meehan, PhD
Postdoctoral Researcher

Biochemistry & Molecular Biology
Dalhousie University
5850 College Street
P.O. Box 15000
Halifax, NS, Canada 
B3H 4R2
gg_only_genomes.zip

Morgan Langille

unread,
Oct 31, 2012, 8:42:06 AM10/31/12
to pplace...@googlegroups.com
Thanks Conor,

Also, just to clarify this contains only sequences in greengenes that
have completed reference genomes. You could build a reference package
with the entire Greengenes tree (or maybe one collapsed at 97%
identity) using FastTree which is computationally feasible. This would
give you better sampling across the tree of life, but might cause an
increase in pplacer uncertainty. Also of course depends on what
taxonomic level of classification you are hoping to achieve (e.g.
species, genus, family, etc).

Good luck!

Morgan

sinclair

unread,
Oct 31, 2012, 11:54:18 AM10/31/12
to pplace...@googlegroups.com
Conor, thanks for your answer and your attachement. You say you retrieved the stats file from an already built tree using RAxML for use with pplacer. How did you do this ? I tried the -f n command like this:

$ raxmlHPC -f n -z example.newick -s example.phylip -m GTRGAMMA -n retrieve

Firstly, this supposes that the existing tree was built with the "GTRGAMMA", which I am not sure is the case for green genes or silva. Secondly, this did not produce the line that pplacer seems to be interested in, namely the substitution rates, that are only found in the original RAxML_info.txt file:

Inference[0]: Time 0.182231 GAMMA-likelihood -112.619367, best rearrangement setting 2, alpha[0]: 33.447932 rates[0] ac ag at cg ct gt: 0.000017 1.607633 0.000017 0.000017 0.728746 1.0

Erick Matsen

unread,
Oct 31, 2012, 8:49:38 PM10/31/12
to pplace...@googlegroups.com
Hello everyone--


Well, you are welcome to use the software in any way you like, but it
should be noted that the branch lengths in the GG tree are computed
with FastTree, which means that it will be a bit odd to be computing
rate parameters for that (fixed) tree using RAxML. The implementations
of CAT in the two programs are different, and the gamma rate
heterogeneity is something else entirely.


Erick

Conor Meehan

unread,
Nov 1, 2012, 10:55:45 AM11/1/12
to pplace...@googlegroups.com
I think that even if it is built with the incorrect stats file in
pplacer it is still giving much better results than RDP does which is
important for us. Perhaps tests of the effect of different stats files
on placements of such trees would be a good thing to look at? Myself
and Morgan are willing to do some pplacer/rdp/etc comparisons for such
things if you guys would be interested?

Lucas, I ran both the -f n and -f e commands to get the rates etc
The commands are:
./raxmlHPC -f n -s gg_only_genomes.phy -m GTRGAMMA -z
gg_only_genomes.tree -n like
and
./raxmlHPC -f e -s gg_only_genomes.phy -m GTRGAMMA -t
gg_only_genomes.tree -n likeE

And then plug in the values as needed for pplacer creation. Hope that helps


Cheers,
C
Reply all
Reply to author
Forward
0 new messages