There are two major components to the analysis pipeline detailed in the tutorial, the first is to prepare the data for analysis (dealing with barcoded samples, quality filtering, etc.) and the second is the actual analysis. The first component ends with the output of the split_libraries.py which is a FASTA file (.fna) and a user-generated Mapping file (.txt). The issue with using the QIIME scripts to prepare PGM data in the first component is that the QIIME scripts expect the data in 454 format. One approach to solve this problem would be to write a script to convert the PGM quality scores to 454 quality scores, however, as Tony pointed out, there may still be issues with error modeling and as I found out, my use of the same barcode more than once in the project would also have prevented me from using split_libraries.py successfully. The alternative is to process the PGM data outside of QIIME to create the starting files and enter the QIIME pipeline at the second component where the actual analysis scripts are run. The starting files we need to use QIIME entering the workflow at pick_otus_through_otu_table.py, are a FASTA file (.fna) and a Mapping file (.txt) describing the samples. To create these files, I performed the following:
1) We ran five 316 PGM chips, each chip was loaded with 4 barcoded libraries generated from a human bacterial sample. Thus, we have five FASTQ files representing 20 samples. Ten samples were in treatment group "high" and ten were in treatment group "low". The question is whether there are significant differences in the microbial communities between the two treatment groups - analogous to the fasting example in the tutorial.
2) The FASTQ files were converted to FASTA files (I used a NextGENe). The quality information in the FASTQ file was used during the conversion to filter out poor quality reads (median quality score = 30), and trim reads to remove low quality ends (trim after 1 base quality =<20), and short reads (minimum read length = 79). The minimum length of 79 was chosen so that all reads contained all of the V6 hypervariable region targeted by our primers. These quality filters are very stringent, I chose this level of stringency so I would only be working with very high quality sequence with the added benefit that the number of sequences analyzed was reduced thus reducing script run times.
3) The five resulting FASTA files were split based on barcode to produce 20 FASTA files, one for each sample, and the sequences in the FASTA files were stripped of the barcode sequence as well as the primer sequence on both ends. The resulting 20 FASTA files now have sequence that is ready for analysis, but the headers are not what the QIIME scripts need, and they need to be combined into a single file.
4) I combined the 20 FASTA files into a single file and re-formatted the headers to fit the format output of QIIME's spplit_libraries.py using a PERL script (if you want the script, email me).
5) I made the Mapping file by hand in a text editor following the example provided in the tutorial data.
After the 5 steps above, I had two files: PGM.fna and PGM.txt. The first few sequences of PGM.fna look like:
>112.BC1_1 RXB0T:48:298 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
CACCACCTGTCTACCGGTTCCCGAAGGCACAGTCATACTTCTATGACCTCCCGGAGATGTCAAGGTCTGGTA
>112.BC1_2 RXB0T:49:264 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
CAGCACCTGTGTTACGGTTCCCGAAGGCACTCCTCCGTCTCTGGAGGATTCCGTACATGTCAAGACC
>112.BC1_3 RXB0T:32:335 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
ACCACCTGCACACGACCAACTAAATGCCACCACATCTCTGCAGTGTCGCCGTGCATGTCAAGCCT
>112.BC1_4 RXB0T:40:344 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
CACCACCTGTTTTCTGGCTTCCGAAGAAGAGGAACTATCTCTAGTTCTGTCCATCAATGTCAAGACC
>112.BC1_5 RXB0T:44:307 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
CAGCACCTGTCTCATGGTTCCCGAAGGCACTCCTCCGTCTCTGGAGGATTCCGTACATGTCA
Where 112.BC1 is the sampleID.barcode followed by _sequencenumber (starting at 1 and counting up). The remaining parts of the header are not required, but were easy to code into the script and result in headers that exactly mimic the output of split_librares.py so I put them in just in case I was wrong about them not being required.
The Mapping file, in its entirety, is:
#SampleID BarcodeSequence LinkerPrimerSequence Treatment DOB Description
#mapping file for PGM.fna for the QIIME analysis package.
112.BC1 ACTCA YATGCTGCCTCCCGTAGGAGT low 20120910 112.BC1
113.BC1 ACTCA YATGCTGCCTCCCGTAGGAGT low 20120910 113.BC1
119.BC6 AGCCA YATGCTGCCTCCCGTAGGAGT low 20120910 119.BC6
127.BC2 CGTGT YATGCTGCCTCCCGTAGGAGT low 20120910 127.BC2
129.BC7 GACAT YATGCTGCCTCCCGTAGGAGT low 20120910 129.BC7
130.BC3 GATGA YATGCTGCCTCCCGTAGGAGT low 20120910 130.BC3
131.BC4 CGATA YATGCTGCCTCCCGTAGGAGT low 20120910 131.BC4
132.BC1 ACTCA YATGCTGCCTCCCGTAGGAGT low 20120910 132.BC1
138.BC5 CTTAC YATGCTGCCTCCCGTAGGAGT high 20120910 138.BC5
139.BC8 GCCTT YATGCTGCCTCCCGTAGGAGT low 20120910 139.BC8
143.BC1 ACTCA YATGCTGCCTCCCGTAGGAGT low 20120910 143.BC1
150.BC2 CGTGT YATGCTGCCTCCCGTAGGAGT high 20120910 150.BC2
151.BC1 ACTCA YATGCTGCCTCCCGTAGGAGT high 20120910 151.BC1
156.BC2 CGTGT YATGCTGCCTCCCGTAGGAGT high 20120910 156.BC2
162.BC3 GATGA YATGCTGCCTCCCGTAGGAGT high 20120910 162.BC3
165.BC6 AGCCA YATGCTGCCTCCCGTAGGAGT high 20120910 165.BC6
166.BC4 CGATA YATGCTGCCTCCCGTAGGAGT high 20120910 166.BC4
177.BC3 GATGA YATGCTGCCTCCCGTAGGAGT high 20120910 177.BC3
185.BC4 CGATA YATGCTGCCTCCCGTAGGAGT high 20120910 185.BC4
208.BC7 GACAT YATGCTGCCTCCCGTAGGAGT high 20120910 208.BC7
Note that we re-used barcodes between chips. In retrospect, this would have been a serious problem if we had tried to use the split_libraries.py script which expects a unique barcode for each sample. The LinkerPrimerSequece is not in our sequence, but was in the tutorial Mapping file and again, in the interest of mimicking the functioning test files as much as possible, I left it in my Mapping file. I do not think it is used by any of the scripts downstream of pick_otus_through_otu_table.py and is therefore not necessary. I think the same is true of the BarcodeSequence, DOB, and Description, but rather than finding out I was wrong the hard way, I kept them in the file.
So, I have the files I need to enter the QIIME pipeline at pick_otus_through_otu_table.py. Below I am showing the actual command line as I proceed through the scripts in the tutorial. My comments are indicated with #.
gwatts@futscherlab32gb:~/QIIME_files$ cd PGM
gwatts@futscherlab32gb:~/QIIME_files/PGM$ ls
PGM.fna PGM.txt
# The starting files.
gwatts@futscherlab32gb:~/QIIME_files/PGM$ pick_otus_through_otu_table.py -i PGM.fna -o otus/
# Picks the OTUs to be used in the downstream analyses and saves results to directory “otus”
gwatts@futscherlab32gb:~/QIIME_files/PGM$ per_library_stats.py -i otus/otu_table.biom
Num samples: 20
Num otus: 3968
Num observations (sequences): 3920441.0
Seqs/sample summary:
Min: 37935.0
Max: 508787.0
Median: 186429.0
Mean: 196022.05
Std. dev.: 130457.80878
Median Absolute Deviation: 100658.0
Default even sampling depth in
core_qiime_analyses.py (just a suggestion): 37935.0
Seqs/sample detail:
132.BC1: 37935.0
177.BC3: 40072.0
150.BC2: 43613.0
185.BC4: 50611.0
143.BC1: 85096.0
130.BC3: 86446.0
112.BC1: 93455.0
138.BC5: 103118.0
131.BC4: 144820.0
139.BC8: 154006.0
166.BC4: 218852.0
208.BC7: 236403.0
129.BC7: 237947.0
162.BC3: 250536.0
113.BC1: 285412.0
165.BC6: 304125.0
127.BC2: 305059.0
119.BC6: 321998.0
156.BC2: 412150.0
151.BC1: 508787.0
# Checking the results of pick_otus_through_otu_table.py, we see there are 3968 OTUs. There are 4 samples (132, 177, 150, 185) with significantly fewer reads than all the rest - these four were from the same chip and reflect a mediocre sequencing run due to low templated bead yield from emulsion PCR. The sample with the fewest sequences is 132, with 37,935 - this will come into play when we look at the beta diversity later on which requires us to specify the number of sequences to use....
gwatts@futscherlab32gb:~/QIIME_files/PGM$ make_otu_heatmap_html.py -i otus/otu_table.biom -o otus/OTU_Heatmap/
# Makes a heatmap-generating html file and supporting javascript directory that, when FTP'd to the desktop, opens in a browser allowing interactive heatmaps. A sample image is attached. Note that the heatmap shows no sample sequence identified deeper than taxonomic family, more on this later.
gwatts@futscherlab32gb:~/QIIME_files/PGM$ make_otu_network.py -m PGM.txt -i otus/otu_table.biom -o otus/OTU_Network
#generates files that can be used in Cytoscape to view an OTU network. Calling Cytoscape in the command line with either a lowercase or uppercase spelling failed though:
gwatts@futscherlab32gb:~/QIIME_files/PGM$ Cytoscape
Cytoscape: command not found
gwatts@futscherlab32gb:~/QIIME_files/PGM$ cytoscape
cytoscape: command not found
gwatts@futscherlab32gb:~/QIIME_files/PGM$
#is cytoscape installed?:
gwatts@futscherlab32gb:~/QIIME_files/PGM$ cd
gwatts@futscherlab32gb:~$ find -iname *cytoscape*
./qiime_software/qiime-1.5.0-release/doc/scripts/cytoscape_usage.rst
./qiime_software/cytoscape-2.7.0-release
./qiime_software/cytoscape-2.7.0-release/src/cytoscape
./qiime_software/cytoscape-2.7.0-release/src/cytoscape/CytoscapeVersion.java
./qiime_software/cytoscape-2.7.0-release/src/cytoscape/Cytoscape.java
./qiime_software/cytoscape-2.7.0-release/src/cytoscape/giny/CytoscapeRootGraph.java
./qiime_software/cytoscape-2.7.0-release/src/cytoscape/giny/CytoscapeFingRootGraph.java
....
#there's lots more matches, the point is that cytoscape is installed, but doesn't start simply by typing cytoscape in the command line...ok, moving on….
gwatts@futscherlab32gb:~/QIIME_files/PGM$ summarize_taxa_through_plots.py -i otus/otu_table.biom -o PGM_taxa_summary -m PGM.txt
#This generates a html file and javascript directory that creates graphs of the bacterial diversity across the samples. A sample image is attached. Again I see that the taxonomic level does not get below the level of family....
gwatts@futscherlab32gb:~/QIIME_files/PGM$ echo "alpha_diversity:metrics shannon,PD_whole_tree,chao1,observed_species" > alpha_params.txt
#setting up alpha diversity analysis...
gwatts@futscherlab32gb:~/QIIME_files/PGM$ alpha_rarefaction.py -i otus/otu_table.biom -m PGM.txt -o PGM_arare/ -p alpha_params.txt -t otus/rep_set.tre
#outputs a html file and javascript directory allowing rarefaction curves for the samples. A sample image is attached. We can see that the 4 samples with the fewest reads would benefit from additional reads as they haven't flattened out like the others before running out of reads....
gwatts@futscherlab32gb:~/QIIME_files/PGM$ beta_diversity_through_plots.py -i otus/otu_table.biom -m PGM.txt -o PGM_bdiv_even37900/ -t otus/rep_set.tre -e 37900
#Finally, we get at the experimental question! I used 37,900 reads as the limit, analogous to the tutorial's 146 reads as this is the number of reads in the sample with the fewest reads (#136). The question is: Is there a difference in the bacterial communities between high and low treatment groups? Output is html files and associated directory that starts the king application to graph the samples across the 3 principal coordinates. A Sample 3D plot (beta_diversity_weighted3dplot.jpg) is attached. Ideally, the treatment groups would account for the major portion of variability between the samples and thus represent the PC1 axis. If the samples separate along PC1 as they did in the tutorial test data, it is suggestive of a difference between the groups that can be tested with the jackknife workflow. Unfortunately, my samples do not appear to separate based on treatment. That said, we can see that there is considerable variability between the samples, so we can say that the reason we didn’t get a positive result was not complete uniformity in the bacterial content of the samples.
The negative result could be caused by several possibilities:
Was the V6 hypervariable sequence information-dense enough? If the sequence I used was not capable of classifying bacteria past, say, the kingdom level, then all samples would look the same and there would be no differences. I would answer that the alpha rarefaction, the spread of the samples in space on the beta diversity plot, and the diversity seen in the bacterial diversity graph all argue that I have enough reads in my samples, and there is significant variation in the bacteria present across the samples. It’s just that this variation just does not reflect the treatment group. Would the QIIME experts agree? Nevertheless, the taxonomic assignment never got past the family level. Could this explain the negative result? Sundquist et al. 2007 showed that even the short V6 reads should be capable of assigning a majority of bacteria to the genus level. Did I do something wrong? Checking this forum, I found a thread in which Tony has discussed this issue with another user named George:
######################################################################################################
On Friday, January 27, 2012 11:01:29 AM UTC-7, TonyWalters wrote:
Hello George,
Several factors come into play-how long your reads are, what 16S region they cover (V2 and V4 perform better than V6 in most cases for instance), and how complete the reference dataset is (for example if there were a poorly covered region of the tree of life, with only a single sequence that was quite a bit different than other sequences, then you could get a false positive genus level assignment for a read that was somewhat similar to that sequence). Additionally, the training set itself can be sliced out to include only the region in question (i.e., the section between the 27f and 338r primers, to get rid of the background noise of the full sequence lengths, see
http://www.ncbi.nlm.nih.gov/pubmed/21716311).
The short answer is yes, the RDP classifier can get genus level assignments, but care should be taken with such assignments. If a particular OTU was assigned a genus and was higher in abundance under certain conditions such as when a particular carbon source was added, you would want to find that sequence in the rep set and make sure you get consistent results with some other test, such as blasting on the NCBI site.
Another question that is being asked is why previously the RDP classifier was giving genus level assignments but goes only to family now-the default RDP taxonomy allows genus level assignments, but any time you retrain the RDP classifier it has a strict 5-level requirement for the taxonomy strings, which is why be default, they only go down to family with the Greengenes retraining.
Hope this helps,
Tony Walters
######################################################################################################
######################################################################################################
Hello Walid,
The default taxonomy mapping file for the Greengenes 4 February release only goes to the family level. I've attached a modified mapping file that goes down to the genus level (note that there is a restriction to five levels of taxonomy defined, so in the case of the genus level retraining file, you get phylum/class/order/family/genus, in the original you get domain/phylum/class/order/family).
Best regards,
Tony Walters
######################################################################################################
I re-ran my analysis using the new mapping file (greengenes_tax_rdp_train_genus.txt) making sure to specify this new genus-level mapping file with the -t parameter when running assign_taxonomy.py. The script assign_taxonomy.py is called by pick_otus_through_otu_table.py which calls 7 scripts (including assign_taxonomy.py) automatically. In order to use the new mapping file I needed to specify it in the qiime_paraters.txt file. Being a newbie, I opted to simply run each script called by pick_otus_through_otu_table.py one at a time so I could specify the new mapping file on the command line. This analysis worked, and the new heatmap shows bacteria now being assigned down to the genus level (see attached: genus_heatmap.jpg). But, alas, the beta diversity graph still does not show a difference between the treatment groups (see attached: genus_beta_diversity.jpg). Interestingly, only ~20% of the bacteria are identified to the genus level - lower than I expected based on the Sundquist et al. 2007 paper.
I then tried be less of a newbie and actually run the new genus-level analysis using pick_otus_through_otu_table.py specifying the greengenes_tax_rdp_train_genus.txt file in qiime_parameters.txt:
I made a new directory called Piped_Genus_level and copied in qiime_parameters.txt and edited the taxonomy assignment parameters:
# Taxonomy assignment parameters
assign_taxonomy:id_to_taxonomy_fp /mnt/datas/home/gwatts/QIIME_files/Piped_Genus_level/greengenes_tax_rdp_train_genus.txt
assign_taxonomy:reference_seqs_fp
assign_taxonomy:assignment_method rdp
assign_taxonomy:blast_db
assign_taxonomy:confidence 0.8
assign_taxonomy:e_value 0.001
And my directory now looks like this:
gwatts@futscherlab32gb:~/QIIME_files/Piped_Genus_level$ ls
greengenes_tax_rdp_train_genus.txt PGM.fna PGM.txt qiime_parameters.txt
# So, I have the new genus-level mapping file, the sequence in the .fna, the mapping file in PGM.txt, and lastly, the new qiime_parameters specifying to use the genus-level mapping file in the Taxonomy assignment parameters.
When I run the scripts though, my output does not identify bacteria to the genus level. I am doing something wrong.
My questions to the QIIME experts are:
1) Why is cytoscape "command not found" when I try to run it? The install is there as I showed above.
2) Have I performed a reasonable analysis? In other words, would you agree that the negative result I have is because my treatment groups don't differ in bacterial diversity or is there something I have missed?
3) Would adding V1-V2 sequence reads from the same samples to the analysis have a reasonable chance of changing the result, or is the negative result from the V6 reads pretty much the end of the road? Put another way, is there something else you would try if you got this result, or would you conclude that the experiment had a negative result and move on?
4) What did I do wrong when I tried to run the analysis using a modified qiime_parameters.txt specifying the genus-level mapping file?
Thanks for reading all of this, I am trying to be very clear and complete in describing my analysis.