I have a somewhat complicated question concerning the types of inputs that can be used with PICRUSt2. Our 16S sequencing and bacteria profiling was done by the Genome Technology Access Center at Washington University using a method that differs from the typical 16S sequencing.
Instead of using one amplicon that covers one or more hypervariable regions, their method uses 14 different primer pairs to generate 14 amplicons covering 9 different hypervariable regions in the 16S rRNA gene. They then, (in their own words), align the reads from multiple amplicons and then use the alignment counts to call specific species as being present. They use a database based on the SILVA database for this step. If a read can't be assigned a species, they use a separate database to assign it a genus, if possible. According to them, the benefit of this method is significantly increased accuracy and sensitivity in species-level calls. I have attached a brief slide show they sent us describing the process, if you want/need more details or a better explanation.
While I don't doubt that their process is better at species level analysis than just relying on one or two variable regions, it does present some problems when it comes to performing metagenomic analysis on the samples. We have the biom table that they generated using their method, but we lack representative sequences for the identified OTUs because their process does not generate representative sequences. According to the person at Wash U I have been corresponding with, all reads from the 14 primer pairs are treated as separate. They do not even do paired joining for the forward and reverse primer pairs because some of them have little or no overlap between forward and reverse reads. In his words, "multiple amplicons are aligned and then the alignment counts are used to call specific species as being present. This means that a GV9 “OTU” is a species call". He had two suggestions: "To make a GV9 representative sequence one could just select one amplicon (ie V4) and randomly select one of the V4 reads that aligned to the species as the representative. Another way would be to select one full length 16S reference sequence for each species and return that as a representative sequence." Do you think that either of these methods would work for PICRUSt2? One issue I see with his second suggestion is that while the biom table does give specific species, when I look on the SILVA database, each species has multiple 16S sequences, and I have no idea which if any, would be an acceptable representative 16S sequence.
I have looked into using the raw sequencing data, but there I have run into problems as well. We do have the raw FASTQ files that contain all the amplicon reads for each sample, but since the different amplicons cover different parts of the 16S rRNA gene, that means that one bacterial genome could give rise to two or more reads if it was recognized by more than one primer pair. (For example, if the primer pair amplifying V6 and the primer pair amplifying V3 both bind the gene, then two reads would be generated from that genome.) This obviously means that different bacterial species will have different numbers of reads per bacterial genome depending on how many of the primer pairs they bound, which would make an analysis that relied on the relative quantification of species futile.
The last resort would be to pick one primer pair and use the use the raw FASTQ files from it to do the PICRUSt2 analysis with that restricted set of data. My PI is fairly keen on not resorting to this, though. She doesn't want to lose the data from the other amplicons.
I would greatly appreciate any advice you could give me on how to fit this square peg in a round hole.