Using sequencing data from multiple amplicons in PICRUSt2

James Harder

unread,

Jun 26, 2018, 2:30:16 PM6/26/18

to picrust-users

I have a somewhat complicated question concerning the types of inputs that can be used with PICRUSt2. Our 16S sequencing and bacteria profiling was done by the Genome Technology Access Center at Washington University using a method that differs from the typical 16S sequencing.

Instead of using one amplicon that covers one or more hypervariable regions, their method uses 14 different primer pairs to generate 14 amplicons covering 9 different hypervariable regions in the 16S rRNA gene. They then, (in their own words), align the reads from multiple amplicons and then use the alignment counts to call specific species as being present. They use a database based on the SILVA database for this step. If a read can't be assigned a species, they use a separate database to assign it a genus, if possible. According to them, the benefit of this method is significantly increased accuracy and sensitivity in species-level calls. I have attached a brief slide show they sent us describing the process, if you want/need more details or a better explanation.

While I don't doubt that their process is better at species level analysis than just relying on one or two variable regions, it does present some problems when it comes to performing metagenomic analysis on the samples. We have the biom table that they generated using their method, but we lack representative sequences for the identified OTUs because their process does not generate representative sequences. According to the person at Wash U I have been corresponding with, all reads from the 14 primer pairs are treated as separate. They do not even do paired joining for the forward and reverse primer pairs because some of them have little or no overlap between forward and reverse reads. In his words, "multiple amplicons are aligned and then the alignment counts are used to call specific species as being present. This means that a GV9 “OTU” is a species call". He had two suggestions: "To make a GV9 representative sequence one could just select one amplicon (ie V4) and randomly select one of the V4 reads that aligned to the species as the representative. Another way would be to select one full length 16S reference sequence for each species and return that as a representative sequence." Do you think that either of these methods would work for PICRUSt2? One issue I see with his second suggestion is that while the biom table does give specific species, when I look on the SILVA database, each species has multiple 16S sequences, and I have no idea which if any, would be an acceptable representative 16S sequence.

I have looked into using the raw sequencing data, but there I have run into problems as well. We do have the raw FASTQ files that contain all the amplicon reads for each sample, but since the different amplicons cover different parts of the 16S rRNA gene, that means that one bacterial genome could give rise to two or more reads if it was recognized by more than one primer pair. (For example, if the primer pair amplifying V6 and the primer pair amplifying V3 both bind the gene, then two reads would be generated from that genome.) This obviously means that different bacterial species will have different numbers of reads per bacterial genome depending on how many of the primer pairs they bound, which would make an analysis that relied on the relative quantification of species futile.

The last resort would be to pick one primer pair and use the use the raw FASTQ files from it to do the PICRUSt2 analysis with that restricted set of data. My PI is fairly keen on not resorting to this, though. She doesn't want to lose the data from the other amplicons.

I would greatly appreciate any advice you could give me on how to fit this square peg in a round hole.

Wash U GV9 Sequencing Slideshow.pdf

Wash U GV9 Biom table.xls

Gavin

unread,

Jun 28, 2018, 3:42:48 PM6/28/18

to picrus...@googlegroups.com

Hi there,

I'll be away until July 10th, but I can give you my quick thoughts! It would be possible to use all of the raw reads from the different amplicons, but it would be extremely slow and likely wouldn't add in much information (and would be difficult to analyze). I would personally run each of the amplicons separately through DADA2 or deblur to get amplicon sequence variants and then run each of these datasets individually (and compare how consistent this across the predictions were across different variable regions). Other than that I'm not sure what to recommend, sorry!

Best,

Gavin

--
You received this message because you are subscribed to the Google Groups "picrust-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to picrust-user...@googlegroups.com.
To post to this group, send email to picrus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Harder

unread,

Aug 14, 2018, 11:56:42 AM8/14/18

to picrust-users

Hi Gavin,

To make a long story short, I can't generate amplicon sequence variants from our raw data because for most of the amplicons there is not enough overlap between the primers to do paired end analysis in DADA2. Apparently, since Wash U's method doesn't involve paired end analysis, the primers that they pick are not suited to it.

So I am back where I started, with a species-level Biom table and no representative sequences. I am currently working on generating a representative sequence table by looking up each OTU on the SILVA database and selecting the highest quality complete 16S sequence to use as the representative sequence for that OTU. My question to you is how do you think this approach will affect the results generated by Picrust2 compared to normal OTU picking? Since some species have multiple sequences in the SILVA database, will picking one versus another (assuming both are complete and high quality) significantly affect the output of Picrust2?

James Harder

unread,

Aug 14, 2018, 11:59:33 AM8/14/18

to picrust-users

Sorry, hit post too soon.

I know that is is not the typical user case, but I would appreciate any insight you could give me.

Thank you,

Jim Harder

Gavin Douglas

unread,

Aug 14, 2018, 12:32:20 PM8/14/18

to picrus...@googlegroups.com

Hi again,

You could certainly take that approach and it would likely give you a *similar* profile to what you would get if you plugged in ASVs. The output would certainly be lower quality though if only due to errors in taxonomy assignment and the variation in 16S sequences in genera and species. If the seqeunces within the same species are very similar then the predicted profile will also be similar. If they are distinct then it could definitely throw off the predictions. However, this is an issue with making predictions based on single 16S sequences anyway. I only kept 1 sequence per genome for the database and in many cases this selection was random if for instance there were 2 sequences of equal quality.

An alternative solution could be to use PanFP (paper: https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-015-1462-8), which is similar to PICRUSt, but predicts functions based on taxonomy levels rather than sequences. I think this would likely be the easiest approach to take and would be the simplest to explain in the future when you’re describing your results.