Hi Ali and Marcus,
Sorry for the slow response. I thought I'd consolidate your questions into one response to keep things on a single thread.
From Ali
Apropos of my earlier post and the very helpful comments by Jesse, I have some more questions to ask. I'd greatly appreciate if they can answered. Thanks a lot.
1) How does the sequence count for each OTU, in a given sample, impact and/or contribute to the calculations for metagenome prediction? Is the metagenome prediction independent of the value of the of sequence read provided for each OTU in each sample? In our case, we have PhyloChip data for our sample, which provides fluorescence intensities instead of sequence read. Jesse suggested that Picrust will treat each unit of fluorescence intensity as if it was a sequence read. If the actual sequence count is not important, then it is not a problem. If it is, then should we try to figure out a way to convert the fluorescence intensities to relative abundances?
2) For one of our analysis, we are only interested in predicting the metabolic capability of certain bacteria. As we have the NCBI accession number of the bacteria, we retrieved the 16S rRNA sequences of the bacteria from NCBI nucleotide database. A single FASTA file was then written which contained a collection of 16S rRNA sequences, with each sequence corresponding to a specific bacterium. Using Picrust, the FASTA file was then used to pick OTUs and create a biom-formatted OTU table, with OTUs assigned at 97% identity. The biom-formatted table was converted to tab-delimited format using convert_biom.py script.
To start out with I'd just like to give a quick overview of what PICRUSt is doing during metagenome prediction, since it is really very simple. I think this may help to clarify a lot of questions simultaneously.
PICRUSt has two stages: genome prediction and metagenome prediction. During the genome prediction stage, we use the greengenes tree + sequenced genomes to infer which genes families (KO groups) are present, and the copy number of each, in each greengenes 97% OTU. If OTUs already have sequenced genomes then we incorporate that information.
This information is all precalculated and provided to the community as ko_precalculated.biom.gz (in your PICRUSt download under /picrust/picrust/data/). We do the same thing for 16S rRNA copy numbers, and the file is saved as 16S_precalculated.biom.gz in the same folder. Although gzipped and in BIOM format, you can gunzip them and convert to text with convert_biom.py -b if you'd like to take a look at the information. Simply, it is a predicted table of all KO gene copy numbers across all bacterial and archaeal genomes.
So if you would like PICRUSt predictions for the KOs in particular organisms, they are in that file (but note that the OTU ids are relative to the greengenes tree).
During metagenome prediction we use the genome predictions to translate observed relative abundances of 16S rRNA gene sequences, and convert them to predicted relative abundances of KO categories. This is really just a matter of arithmetic given the precalculated genome prediction files.
If we know the (relative) number of 16S rRNA sequences abundance for an organism, then the metagenome contribution to a given KO for that organism is:
(16S_relative_abundance / inferred_16S_copy number ) * inferred_gene_copy_number
So we just sum this up across all organisms and all genes to get a predicted metagenome.
For example, lets say we observe 100 16S rRNA sequences from an organism with a predicted 16S rRNA copy number of 4 and a count in some KO of 2. We would:
1. Divide by the (predicted) number of 16S rRNA copies for that organism to get the organismal relative abundance. So if the organism had 4 16S rRNA copies, we would divide the observed relative abundance of 16S rRNA genes by 4 to find the relative abundance of the organism (25). This step is performed by normalize_by_copy_number.py If you'd like more information on 16S rRNA copy number normalization and why it might be important, you could read
this paper from Steve Kemble in PLoS One. After running normalize_by_copy_number.py we have an estimate for the relative abundance of the
organism rather than the relative abundance of the 16S rRNA
gene for that organism.
2. Multiply the organismal relative abundance by the predicted copy number of KOs. So in this case we would multiply 25 * 2 = 50. This is performed in predict_metagenomes.py
In practice, each of these steps is performed in a matrix across all KOs x organisms at once, but this should give the general idea. Sooooo.... having endured the lengthy explanation I'll give some very quick responses to your questions in light of what's going on behind the scenes:
Since we are only interested in predicting the metagenomes of the set of bacteria, can just a single column having a value of 1 be sufficient for the purpose of predicting the metagenome? Please see a sample input file (in tab-delimited) format attached. Also, do we need to normalize the OTU table using the normalize_by_copy_number.py script?
I don't think you want to do this. The input table would assume a single sample with only 7 OTUs, each at exactly equal abundance, and attempt to predict a metagenome. Based on some of Marcus' questions (below), it sounds like what you might really be interested in is the contribution of each of 50 OTUs to the overall metagenome. You could get this by making a table with your 50 OTUs and running metagenome_contributions.py
From Marcus
The fluorescence intensities (FI) have already been normalized. Low FI indicates low abundance of an OTU. One of the advantages of the PhyloChip over sequencing is that the chip can detect low abundance taxa with the same efficiency as taxa in high abundance.
Great! The goal for input to PICRUSt is an accurate representation of the 16S rRNA gene relative abundance. It sounds like you've already done the normalizations that you feel are necessary to ensure that your table accurately reflects 16S rRNA gene relative abundance across your samples (but see below for discussion of 0s and data transformation).
Let me briefly outline what we would like to use PICRUSt for: we have a number of OTUs that we found significantly more abundant in samples from treatment 1 compared to treatment 2. What I would like to determine is the metabolic capability of those OTUs that are significantly more abundant in treatment 1. What I have is a list of OTUs ranked by abundance in treatment 1. We thought we could use PICRUSt to predict the metabolic capability of let's say the top 50 or top 200 OTUs in that list. It is not clear to me if this approach would require providing FI information but I believe Ali found a way to do this assessment without FI.
One of the outputs of the protocol will be a file mapping your sequences (from the probes in this case) to the OTU ids on our tree. You would then use this mapping to pull out any OTUs of interest from the ko_precalculated.biom.gz
Judging from what Ali found out and from your comments it seems like that PICRUSt can do more than just predict metabolic capabilities. It sounds like it can determine if the inferred metabolic profiles from two different treatment groups are different. We would also be interested in this approach.
Yes, the core workflow for PICRUSt as described above, is to create a predicted or 'virtual' metagenome from a 16S rRNA survey. You could then compare gene frequencies across samples using standard techniques for comparing metagenomes. More detail about some of these downstream analyses can be found here:ko_precalculated.biom.gz
There is no negative control for each sample. As mentioned above, low FI indicates low abundance (or even absence) of an OTU. I think substituting sequence reads with FI is ok as long as this is only used to compare FI (sequence reads) of an OTU between samples. The FI data in our table has been transformed and, thus, differences in abundance are not very obvious (e.g. an OTU with FI of 10,000 in treatment group 1 and an FI of 9000 in treatment group 2 is actually twice as abundant in treatment 1 versus treatment 2). We could un-transform the data if that would benefit PICRUSt calculations. Also, if absolutely necessary for PICRUSt calculations we do have presence/absence information and could create a table that contains 0 when an OTU is absent in a sample and its FI when it is present in a sample.
Based on the calculation for PICRUSt's predicted metagenomes described above, I hope it will be a little more clear what is going on. I would definitely recommend that you un-transform the data so that your table reflects abundance (in a linear scale): so if OTU1 is listed as 10000 and OTU2 is listed as 20000 OTU1 is twice as abundant as OTU2. Also, I would recommend using the table that includes the FI if present and a 0 if absent, at least for OTUs that are present in one or more samples. If an OTU was absent across all samples it will not affect PICRUSt's predictions, so you don't need to include it.
And here is a question from Ali: 1) The Taxon ID for Lactobacillus rhamnosus GG is 644736382 as present on the IMG database. Check the following link:
I can look into this. I think the issue may be that IMG genomes which were incorporated into the greengenes tree use names that concatenate the 16S rRNA sequence accession with the IMG id. So in the gg_nwk file I have available the accession for Lactobacillus rhamnosus is 'NC_013198|644736382'. I'll follow up with Daniel to see if he can pass on a file that describes this mapping for all IMG ids.
3) Furthermore, once a metagenome has been predicted, I get an output
file, which I have converted to tab-delimited format. A sample of the output
file is attached with this email. If I have understood correctly, the first
column contains the KO ID, the second column contains the number of OTUs which
possess the enzyme corresponding to the KO ID and the third column contains
description of the pathway in which the KO is involved. Am I correct? I would
like to know if there is a way in Picrust to figure out which of the OTUs have
which KOs present in them.
The way to do this is with the newly added metagenome_contributions.py script. However, be a touch careful, since the full table of every OTU's contribution to every sample across every KO gets extremely large quickly. So you may want to use the commandline parameters to limit to KOs of interest.
Cheers,
Jesse