good 16S reference tree for a set of analyses performed on soil bacterial communities

63 views
Skip to first unread message

Jens Preem

unread,
Mar 29, 2017, 7:17:47 AM3/29/17
to pplacer users
HI,
I have written a few articles that are based on analyses of soil samples where 16S NGS data was collected alongside different soil chemical parameters.
As I am thinking on my thesis work - I am  looking over these same datasets and thinking on using pplacer to perform edgePCA and maybe acquire Kantorovich-Rubinstein distance matrix of samples for further analysis.
The step I am stuck right now is the construction of the reference tree for placement.

My analyses thus far have the sequences clustered using CROP (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042185/) the sequence and OTU classifications are based on GreenGenes taxonomy.
GreenGenes provides pre-computed trees annotated with its own taxonomy for different similarities but I am afraid that they are probably way too big. (frex. 97_otu. taxonomy and tre files contain ~100K entries)
How big can the tree be before the pplacer faces computational limits? What would be a way to make a suitable tree in my case - a smaller subset from GG or other database maybe compromised on only such sequences that have at least genus level classification?
Any suggestions of constructing a "general" bacterial 16S reference tree of workable size or maybe examples of such already done in some more recent works are really welcome.
JP

siavash mirarab

unread,
Mar 29, 2017, 11:20:54 AM3/29/17
to pplace...@googlegroups.com
I hope pplacer developers don't mind my shameless plug here, but, here it goes ...

We have a tool called SEPP that can run on the green genes dataset with 200,000 sequences as the backbone. SEPP is built on top of pplacer (and HMMER for alignment) with extensive divide-and-conquer. A standalone version that works on the greengenes is available here:


The paper that introduced the approach is:

Mirarab S, Nguyen N, Warnow T. SEPP: SATé-Enabled Phylogenetic Placement. Pacific Symp Biocomput. 2012:247-258.

I was able to do placement on this tree with many tens of thousands of reads in under one day given 16 cores. SEPP has both multithreading and checkpointing. I hope this is useful. 

Thanks
Siavash

--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Siavash Mirarab


Erick Matsen

unread,
Mar 29, 2017, 4:12:55 PM3/29/17
to pplace...@googlegroups.com
Siavash--

Not at all! I think this is the right approach. Naked pplacer can't survive on Greengenes, and it's not a good fit anyhow. HMMs should be more powerful with a significant amount of divergence. 

So, +1.

Erick

To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-user...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Siavash Mirarab


--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-user...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Frederick "Erick" Matsen, Associate Member
Fred Hutchinson Cancer Research Center
http://matsen.fredhutch.org/

Jeff Bowman

unread,
Mar 29, 2017, 4:46:22 PM3/29/17
to pplacer users
Another shameless plug... you might try our paprica pipeline that uses pplacer/infernal to place reads on a reference tree constructed from all completed genomes in Genbank.  In addition to community structure paprica will give you an estimate of the gene contents of your community, other useful info, and normalize for 16S rRNA gene copies.  Check the manual for instructions to parallelize your runs.

Github repository:
https://github.com/bowmanjeffs/paprica

A tutorial for basic analysis can be found here:
http://www.polarmicrobes.org/analysis-with-paprica/

Cheers,
Jeff

Erick Matsen

unread,
Mar 29, 2017, 5:13:40 PM3/29/17
to pplacer users
If y'all want to figure out ways to make your plugs directly on the pplacer repo, like on the README, we can do that. There's also a wiki that's not used.

--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jens Preem

unread,
Mar 30, 2017, 3:12:18 AM3/30/17
to pplacer users

Thank you all for the quick answers. Both of the plugs look promising. I have to take some time to check them out.

Jens Preem

unread,
Mar 30, 2017, 3:40:01 AM3/30/17
to pplacer users
Do I understand correctly that the json output from SEPP can immediately be used in pplacer/guppy as a jplace file? Or does it need some formatting?



siavash mirarab

unread,
Mar 30, 2017, 12:23:21 PM3/30/17
to pplace...@googlegroups.com
The json file produced by SEPP is just like the pplacer output (same format) and can be fed to the guppy for downstream analyses.

On Thu, Mar 30, 2017 at 12:40 AM, Jens Preem <jens...@gmail.com> wrote:
Do I understand correctly that the json output from SEPP can immediately be used in pplacer/guppy as a jplace file? Or does it need some formatting?



--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Siavash Mirarab


Jeff Bowman

unread,
Apr 1, 2017, 11:18:37 AM4/1/17
to pplacer users
Erick,
That would be great.  What would you need from us to include a plug in the README?

Cheers,
Jeff

Erick Matsen

unread,
Apr 1, 2017, 12:06:45 PM4/1/17
to pplace...@googlegroups.com
Just make a pull request changing the README. I suggest a blurb that's a slightly compacted version of what you have here.

Siavash, would you contribute as well?

siavash mirarab

unread,
Apr 7, 2017, 6:32:15 PM4/7/17
to pplace...@googlegroups.com
Just made a pull request (I hope on the right readme file). Thanks Erick!

Regards
Siavash

To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Frederick "Erick" Matsen, Associate Member
Fred Hutchinson Cancer Research Center
http://matsen.fredhutch.org/

--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Siavash Mirarab


Erick Matsen

unread,
Apr 8, 2017, 8:59:46 PM4/8/17
to pplace...@googlegroups.com
OK, the list of tools is now up at https://github.com/matsen/pplacer/ .


Siavash Mirarab


--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff Morris

unread,
Sep 29, 2017, 6:02:39 PM9/29/17
to pplacer users
Siavash - I've been using SEPP to place some V4 16S sequences on the Greengenes tree using your SEPP_GG package.  I managed to get .json files and make nice guppy fat trees, but when I try to compare two placement files using kr or kr_heat in guppy, I get errors.  The first error was that the reference trees were different for the two samples.  It turns out this was because SEPP appended some kind of code to the sequence names in the trees when it made the .json files -- I fixed this using sed at the command line.  Then, I got an error that has stumped me:

Uncaught exception: Newick_bark.No_bl

After poking around in the pplacer source code for a bit and reading the pplacer manual, I think this is because the Greengenes tree contains a lot of very similar or even identical sequences such that there are some branches with 0 length, and this is blowing up guppy.  Have you seen anything similar, and if so, do you have a work-around?

Thanks in advance for your help,
Jeff Morris

On Thursday, March 30, 2017 at 11:23:21 AM UTC-5, Siavash Mirarab wrote:
The json file produced by SEPP is just like the pplacer output (same format) and can be fed to the guppy for downstream analyses.
On Thu, Mar 30, 2017 at 12:40 AM, Jens Preem <jens...@gmail.com> wrote:
Do I understand correctly that the json output from SEPP can immediately be used in pplacer/guppy as a jplace file? Or does it need some formatting?



--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-user...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Siavash Mirarab


siavash mirarab

unread,
Oct 1, 2017, 9:14:45 PM10/1/17
to pplace...@googlegroups.com
Dear Jeff, 

Thanks for the useful info! No, I hadn't tried the kr feature of guppy on SEPP outputs. I'll take a look to see if I can think of a solution. I may make an alternative alignment/tree with no zero-length branches (if possible). 

Thanks
Siavash

To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Siavash Mirarab


Erick Matsen

unread,
Oct 2, 2017, 2:19:16 PM10/2/17
to pplace...@googlegroups.com
Hello Jeff--

That error probably means that the branch lengths got corrupted using your sed. Zero branch lengths should be fine for parsing (if not for downstream analysis), but missing branch lengths are a problem. Are you sure that's not the case?

Erick

siavash mirarab

unread,
Oct 2, 2017, 5:26:12 PM10/2/17
to pplace...@googlegroups.com
Jeff, if you could share with me your json (outside the mailing list) I can see what these extra annotations are and I can make sure they are not there or that they are made optional (or at least double check the set). 

Thanks
Siavash

To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Siavash Mirarab


--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Frederick "Erick" Matsen, Associate Member
Fred Hutchinson Cancer Research Center
http://matsen.fredhutch.org/

--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Siavash Mirarab


Jeff Morris

unread,
Oct 2, 2017, 5:56:29 PM10/2/17
to pplacer users
Thanks! Sending now.
Frederick "Erick" Matsen, Associate Member
Fred Hutchinson Cancer Research Center
http://matsen.fredhutch.org/

--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Siavash Mirarab


Reply all
Reply to author
Forward
0 new messages