Ion Torrent PGM 16s rRNA data analysis walk through: follow up questions

3,618 views
Skip to first unread message

GeorgeWatts

unread,
Oct 2, 2012, 5:40:55 PM10/2/12
to qiime...@googlegroups.com
I am back to report how things went with my QIIME analysis of Ion Torrent PGM 16s rRNA V6 data along with some follow up questions. This post is a follow up to another thread (https://groups.google.com/forum/#!searchin/qiime-forum/PGM/qiime-forum/Tl1EvzXbf9U/r49pYcDbzroJ), but it is off topic to that thread’s subject so I have started this one. 

First I had to get a working install of QIIME, after a lot of excellent help, I had a working install of QIIME detailed here: https://groups.google.com/forum/#!searchin/qiime-forum/PGM/qiime-forum/-fD44oQg4wE/4z47iysiGa0J

Once I had QIIME installed, I worked through the tutorial (http://qiime.org/tutorials/tutorial.html) using the tutorial data (ftp://thebeast.colorado.edu/pub/QIIME-v1.5.0-dependencies/qiime_tutorial-v1.5.0.zip). All went well except for viewing the OTU network with Cytoscape (see below) so I proceeded with the my PGM data.
 
There are two major components to the analysis pipeline detailed in the tutorial, the first is to prepare the data for analysis (dealing with barcoded samples, quality filtering, etc.) and the second is the actual analysis. The first component ends with the output of the split_libraries.py which is a FASTA file (.fna) and a user-generated Mapping file (.txt). The issue with using the QIIME scripts to prepare PGM data in the first component is that the QIIME scripts expect the data in 454 format. One approach to solve this problem would be to write a script to convert the PGM quality scores to 454 quality scores, however, as Tony pointed out, there may still be issues with error modeling and as I found out, my use of the same barcode more than once in the project would also have prevented me from using split_libraries.py successfully. The alternative is to process the PGM data outside of QIIME to create the starting files and enter the QIIME pipeline at the second component where the actual analysis scripts are run. The starting files we need to use QIIME entering the workflow at pick_otus_through_otu_table.py, are a FASTA file (.fna) and a Mapping file (.txt) describing the samples. To create these files, I performed the following:

                1) We ran five 316 PGM chips, each chip was loaded with 4 barcoded libraries generated from a human bacterial sample. Thus, we have five FASTQ files representing 20 samples. Ten samples were in treatment group "high" and ten were in treatment group "low". The question is whether there are significant differences in the microbial communities between the two treatment groups - analogous to the fasting example in the tutorial.
                2) The FASTQ files were converted to FASTA files (I used a NextGENe). The quality information in the FASTQ file was used during the conversion to filter out poor quality reads (median quality score = 30), and trim reads to remove low quality ends (trim after 1 base quality =<20), and short reads (minimum read length = 79). The minimum length of 79 was chosen so that all reads contained all of the V6 hypervariable region targeted by our primers. These quality filters are very stringent, I chose this level of stringency so I would only be working with very high quality sequence with the added benefit that the number of sequences analyzed was reduced thus reducing script run times.
                3) The five resulting FASTA files were split based on barcode to produce 20 FASTA files, one for each sample, and the sequences in the FASTA files were stripped of the barcode sequence as well as the primer sequence on both ends. The resulting 20 FASTA files now have sequence that is ready for analysis, but the headers are not what the QIIME scripts need, and they need to be combined into a single file.
                4) I combined the 20 FASTA files into a single file and re-formatted the headers to fit the format output of QIIME's spplit_libraries.py using a PERL script (if you want the script, email me).
                5) I made the Mapping file by hand in a text editor following the example provided in the tutorial data.
 
After the 5 steps above, I had two files: PGM.fna and PGM.txt. The first few sequences of PGM.fna look like:
                >112.BC1_1 RXB0T:48:298 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
                CACCACCTGTCTACCGGTTCCCGAAGGCACAGTCATACTTCTATGACCTCCCGGAGATGTCAAGGTCTGGTA
                >112.BC1_2 RXB0T:49:264 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
                CAGCACCTGTGTTACGGTTCCCGAAGGCACTCCTCCGTCTCTGGAGGATTCCGTACATGTCAAGACC
                >112.BC1_3 RXB0T:32:335 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
                ACCACCTGCACACGACCAACTAAATGCCACCACATCTCTGCAGTGTCGCCGTGCATGTCAAGCCT
                >112.BC1_4 RXB0T:40:344 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
                CACCACCTGTTTTCTGGCTTCCGAAGAAGAGGAACTATCTCTAGTTCTGTCCATCAATGTCAAGACC
                >112.BC1_5 RXB0T:44:307 orig_bc=ACTCA new_bc=ACTCA bc_diffs=0
                CAGCACCTGTCTCATGGTTCCCGAAGGCACTCCTCCGTCTCTGGAGGATTCCGTACATGTCA
 
Where 112.BC1 is the sampleID.barcode followed by _sequencenumber (starting at 1 and counting up). The remaining parts of the header are not required, but were easy to code into the script and result in headers that exactly mimic the output of split_librares.py so I put them in just in case I was wrong about them not being required.
 
The Mapping file, in its entirety, is:
 
#SampleID       BarcodeSequence LinkerPrimerSequence    Treatment       DOB     Description
#mapping file for PGM.fna for the QIIME analysis package.
112.BC1 ACTCA   YATGCTGCCTCCCGTAGGAGT   low     20120910        112.BC1
113.BC1 ACTCA   YATGCTGCCTCCCGTAGGAGT   low     20120910        113.BC1
119.BC6 AGCCA   YATGCTGCCTCCCGTAGGAGT   low     20120910        119.BC6
127.BC2 CGTGT   YATGCTGCCTCCCGTAGGAGT   low     20120910        127.BC2
129.BC7 GACAT   YATGCTGCCTCCCGTAGGAGT   low     20120910        129.BC7
130.BC3 GATGA   YATGCTGCCTCCCGTAGGAGT   low     20120910        130.BC3
131.BC4 CGATA   YATGCTGCCTCCCGTAGGAGT   low     20120910        131.BC4
132.BC1 ACTCA   YATGCTGCCTCCCGTAGGAGT   low     20120910        132.BC1
138.BC5 CTTAC   YATGCTGCCTCCCGTAGGAGT   high    20120910        138.BC5
139.BC8 GCCTT   YATGCTGCCTCCCGTAGGAGT   low     20120910        139.BC8
143.BC1 ACTCA   YATGCTGCCTCCCGTAGGAGT   low     20120910        143.BC1
150.BC2 CGTGT   YATGCTGCCTCCCGTAGGAGT   high    20120910        150.BC2
151.BC1 ACTCA   YATGCTGCCTCCCGTAGGAGT   high    20120910        151.BC1
156.BC2 CGTGT   YATGCTGCCTCCCGTAGGAGT   high    20120910        156.BC2
162.BC3 GATGA   YATGCTGCCTCCCGTAGGAGT   high    20120910        162.BC3
165.BC6 AGCCA   YATGCTGCCTCCCGTAGGAGT   high    20120910        165.BC6
166.BC4 CGATA   YATGCTGCCTCCCGTAGGAGT   high    20120910        166.BC4
177.BC3 GATGA   YATGCTGCCTCCCGTAGGAGT   high    20120910        177.BC3
185.BC4 CGATA   YATGCTGCCTCCCGTAGGAGT   high    20120910        185.BC4
208.BC7 GACAT   YATGCTGCCTCCCGTAGGAGT   high    20120910        208.BC7
 
Note that we re-used barcodes between chips. In retrospect, this would have been a serious problem if we had tried to use the split_libraries.py script which expects a unique barcode for each sample. The LinkerPrimerSequece is not in our sequence, but was in the tutorial Mapping file and again, in the interest of mimicking the functioning test files as much as possible, I left it in my Mapping file. I do not think it is used by any of the scripts downstream of pick_otus_through_otu_table.py and is therefore not necessary. I think the same is true of the BarcodeSequence, DOB, and Description, but rather than finding out I was wrong the hard way, I kept them in the file.
 
So, I have the files I need to enter the QIIME pipeline at pick_otus_through_otu_table.py. Below I am showing the actual command line as I proceed through the scripts in the tutorial. My comments are indicated with #.
 
gwatts@futscherlab32gb:~/QIIME_files$ cd PGM
gwatts@futscherlab32gb:~/QIIME_files/PGM$ ls
PGM.fna  PGM.txt
# The starting files.
 
gwatts@futscherlab32gb:~/QIIME_files/PGM$ pick_otus_through_otu_table.py -i PGM.fna -o otus/
# Picks the OTUs to be used in the downstream analyses and saves results to directory “otus”
 
gwatts@futscherlab32gb:~/QIIME_files/PGM$ per_library_stats.py -i otus/otu_table.biom
Num samples: 20
Num otus: 3968
Num observations (sequences): 3920441.0
 
Seqs/sample summary:
 Min: 37935.0
 Max: 508787.0
 Median: 186429.0
 Mean: 196022.05
 Std. dev.: 130457.80878
 Median Absolute Deviation: 100658.0
 Default even sampling depth in
  core_qiime_analyses.py (just a suggestion): 37935.0
 
Seqs/sample detail:
 132.BC1: 37935.0
 177.BC3: 40072.0
 150.BC2: 43613.0
 185.BC4: 50611.0
 143.BC1: 85096.0
 130.BC3: 86446.0
 112.BC1: 93455.0
 138.BC5: 103118.0
 131.BC4: 144820.0
 139.BC8: 154006.0
 166.BC4: 218852.0
 208.BC7: 236403.0
 129.BC7: 237947.0
 162.BC3: 250536.0
 113.BC1: 285412.0
 165.BC6: 304125.0
 127.BC2: 305059.0
 119.BC6: 321998.0
 156.BC2: 412150.0
 151.BC1: 508787.0
 
# Checking the results of pick_otus_through_otu_table.py, we see there are 3968 OTUs. There are 4 samples (132, 177, 150, 185) with significantly fewer reads than all the rest - these four were from the same chip and reflect a mediocre sequencing run due to low templated bead yield from emulsion PCR. The sample with the fewest sequences is 132, with 37,935 - this will come into play when we look at the beta diversity later on which requires us to specify the number of sequences to use....

gwatts@futscherlab32gb:~/QIIME_files/PGM$ make_otu_heatmap_html.py -i otus/otu_table.biom -o otus/OTU_Heatmap/
# Makes a heatmap-generating html file and supporting javascript directory that, when FTP'd to the desktop, opens in a browser allowing interactive heatmaps. A sample image is attached. Note that the heatmap shows no sample sequence identified deeper than taxonomic family, more on this later.
 
gwatts@futscherlab32gb:~/QIIME_files/PGM$ make_otu_network.py -m PGM.txt -i otus/otu_table.biom -o otus/OTU_Network
#generates files that can be used in Cytoscape to view an OTU network. Calling Cytoscape in the command line with either a lowercase or uppercase spelling failed though:
 
gwatts@futscherlab32gb:~/QIIME_files/PGM$ Cytoscape
Cytoscape: command not found
gwatts@futscherlab32gb:~/QIIME_files/PGM$ cytoscape
cytoscape: command not found
gwatts@futscherlab32gb:~/QIIME_files/PGM$
 
#is cytoscape installed?:
gwatts@futscherlab32gb:~/QIIME_files/PGM$ cd
gwatts@futscherlab32gb:~$ find -iname *cytoscape*
./qiime_software/qiime-1.5.0-release/doc/scripts/cytoscape_usage.rst
./qiime_software/cytoscape-2.7.0-release
./qiime_software/cytoscape-2.7.0-release/src/cytoscape
./qiime_software/cytoscape-2.7.0-release/src/cytoscape/CytoscapeVersion.java
./qiime_software/cytoscape-2.7.0-release/src/cytoscape/Cytoscape.java
./qiime_software/cytoscape-2.7.0-release/src/cytoscape/giny/CytoscapeRootGraph.java
./qiime_software/cytoscape-2.7.0-release/src/cytoscape/giny/CytoscapeFingRootGraph.java
....
#there's lots more matches, the point is that cytoscape is installed, but doesn't start simply by typing cytoscape in the command line...ok, moving on….
 
gwatts@futscherlab32gb:~/QIIME_files/PGM$ summarize_taxa_through_plots.py -i otus/otu_table.biom -o PGM_taxa_summary -m PGM.txt
#This generates a html file and javascript directory that creates graphs of the bacterial diversity across the samples. A sample image is attached. Again I see that the taxonomic level does not get below the level of family....
 
gwatts@futscherlab32gb:~/QIIME_files/PGM$ echo "alpha_diversity:metrics shannon,PD_whole_tree,chao1,observed_species" > alpha_params.txt
#setting up alpha diversity analysis...
 
gwatts@futscherlab32gb:~/QIIME_files/PGM$ alpha_rarefaction.py -i otus/otu_table.biom -m PGM.txt -o PGM_arare/ -p alpha_params.txt -t otus/rep_set.tre
#outputs a html file and javascript directory allowing rarefaction curves for the samples. A sample image is attached. We can see that the 4 samples with the fewest reads would benefit from additional reads as they haven't flattened out like the others before running out of reads....
 
gwatts@futscherlab32gb:~/QIIME_files/PGM$ beta_diversity_through_plots.py -i otus/otu_table.biom -m PGM.txt -o PGM_bdiv_even37900/ -t otus/rep_set.tre -e 37900
#Finally, we get at the experimental question! I used 37,900 reads as the limit, analogous to the tutorial's 146 reads as this is the number of reads in the sample with the fewest reads (#136). The question is: Is there a difference in the bacterial communities between high and low treatment groups? Output is html files and associated directory that starts the king application to graph the samples across the 3 principal coordinates. A Sample 3D plot (beta_diversity_weighted3dplot.jpg) is attached. Ideally, the treatment groups would account for the major portion of variability between the samples and thus represent the PC1 axis. If the samples separate along PC1 as they did in the tutorial test data, it is suggestive of a difference between the groups that can be tested with the jackknife workflow. Unfortunately, my samples do not appear to separate based on treatment. That said, we can see that there is considerable variability between the samples, so we can say that the reason we didn’t get a positive result was not complete uniformity in the bacterial content of the samples.

The negative result could be caused by several possibilities:
Was the V6 hypervariable sequence information-dense enough? If the sequence I used was not capable of classifying bacteria past, say, the kingdom level, then all samples would look the same and there would be no differences. I would answer that the alpha rarefaction, the spread of the samples in space on the beta diversity plot, and the diversity seen in the bacterial diversity graph all argue that I have enough reads in my samples, and there is significant variation in the bacteria present across the samples. It’s just that this variation just does not reflect the treatment group. Would the QIIME experts agree? Nevertheless, the taxonomic assignment never got past the family level. Could this explain the negative result? Sundquist et al. 2007 showed that even the short V6 reads should be capable of assigning a majority of bacteria to the genus level. Did I do something wrong? Checking this forum, I found a thread in which Tony has discussed this issue with another user named George:
 
######################################################################################################
On Friday, January 27, 2012 11:01:29 AM UTC-7, TonyWalters wrote:
Hello George,
 
This is a tricky question, because the answer really depends upon multiple factors.  The RDP classifier can get genus level assignments (see http://www.ncbi.nlm.nih.gov/pubmed/17586664?dopt=AbstractPlus) but this was with 400 base pair reads.
Several factors come into play-how long your reads are, what 16S region they cover (V2 and V4 perform better than V6 in most cases for instance), and how complete the reference dataset is (for example if there were a poorly covered region of the tree of life, with only a single sequence that was quite a bit different than other sequences, then you could get a false positive genus level assignment for a read that was somewhat similar to that sequence).  Additionally, the training set itself can be sliced out to include only the region in question (i.e., the section between the 27f and 338r primers, to get rid of the background noise of the full sequence lengths, see http://www.ncbi.nlm.nih.gov/pubmed/21716311).
 
The short answer is yes, the RDP classifier can get genus level assignments, but care should be taken with such assignments.  If a particular OTU was assigned a genus and was higher in abundance under certain conditions such as when a particular carbon source was added, you would want to find that sequence in the rep set and make sure you get consistent results with some other test, such as blasting on the NCBI site.
 
Another question that is being asked is why previously the RDP classifier was giving genus level assignments but goes only to family now-the default RDP taxonomy allows genus level assignments, but any time you retrain the RDP classifier it has a strict 5-level requirement for the taxonomy strings, which is why be default, they only go down to family with the Greengenes retraining.
 
Hope this helps,
Tony Walters
######################################################################################################
 
In another thread (https://groups.google.com/forum/?fromgroups=#!topic/qiime-forum/ISXDpbC45xg) Tony posted a greengene reference file that can be used to assign reads to the genus level:
 
######################################################################################################
Hello Walid,
 
The default taxonomy mapping file for the Greengenes 4 February release only goes to the family level.  I've attached a modified mapping file that goes down to the genus level (note that there is a restriction to five levels of taxonomy defined, so in the case of the genus level retraining file, you get phylum/class/order/family/genus, in the original you get domain/phylum/class/order/family).
 
Best regards,
Tony Walters
######################################################################################################
 
I re-ran my analysis using the new mapping file (greengenes_tax_rdp_train_genus.txt) making sure to specify this new genus-level mapping file with the -t parameter when running assign_taxonomy.py. The script assign_taxonomy.py is called by pick_otus_through_otu_table.py which calls 7 scripts (including assign_taxonomy.py)  automatically. In order to use the new mapping file I needed to specify it in the qiime_paraters.txt file. Being a newbie, I opted to simply run each script called by pick_otus_through_otu_table.py one at a time so I could specify the new mapping file on the command line. This analysis worked, and the new heatmap shows bacteria now being assigned down to the genus level (see attached: genus_heatmap.jpg). But, alas, the beta diversity graph still does not show a difference between the treatment groups (see attached: genus_beta_diversity.jpg). Interestingly, only ~20% of the bacteria are identified to the genus level - lower than I expected based on the Sundquist et al. 2007 paper.

I then tried be less of a newbie and actually run the new genus-level analysis using pick_otus_through_otu_table.py specifying the greengenes_tax_rdp_train_genus.txt file in qiime_parameters.txt:
I made a new directory called Piped_Genus_level and copied in qiime_parameters.txt and edited the taxonomy assignment parameters:

# Taxonomy assignment parameters
assign_taxonomy:id_to_taxonomy_fp /mnt/datas/home/gwatts/QIIME_files/Piped_Genus_level/greengenes_tax_rdp_train_genus.txt
assign_taxonomy:reference_seqs_fp
assign_taxonomy:assignment_method       rdp
assign_taxonomy:blast_db
assign_taxonomy:confidence      0.8
assign_taxonomy:e_value 0.001

And my directory now looks like this:
gwatts@futscherlab32gb:~/QIIME_files/Piped_Genus_level$ ls
greengenes_tax_rdp_train_genus.txt  PGM.fna  PGM.txt  qiime_parameters.txt
# So, I have the new genus-level mapping file, the sequence in the .fna, the mapping file in PGM.txt, and lastly, the new qiime_parameters specifying to use the genus-level mapping file in the Taxonomy assignment parameters.
When I run the scripts though, my output does not identify bacteria to the genus level. I am doing something wrong.
 
My questions to the QIIME experts are:
 
1) Why is cytoscape "command not found" when I try to run it? The install is there as I showed above.
 
2) Have I performed a reasonable analysis? In other words, would you agree that the negative result I have is because my treatment groups don't differ in bacterial diversity or is there something I have missed?
 
3) Would adding V1-V2 sequence reads from the same samples to the analysis have a reasonable chance of changing the result, or is the negative result from the V6 reads pretty much the end of the road? Put another way, is there something else you would try if you got this result, or would you conclude that the experiment had a negative result and move on?

4) What did I do wrong when I tried to run the analysis using a modified qiime_parameters.txt specifying the genus-level mapping file?
Thanks for reading all of this, I am trying to be very clear and complete in describing my analysis.

alpha_rarefaction.jpg
beta_diversity_weighted3dplot.jpg
genus_beta_diversity.jpg
genus_heatmap.jpg
heatmap.jpg
taxa_summary_areachart.jpg

Matt Wade

unread,
Oct 3, 2012, 6:03:25 AM10/3/12
to qiime...@googlegroups.com
Thanks George for a very thorough breakdown of running QIIME with IonTorrent data. This will be useful to others beginning to work with PGM rather than 454. There is an analogous thread here.

Regarding your questions I might let the QIIME experts handle those, but a few thoughts:

 
1) Why is cytoscape "command not found" when I try to run it? The install is there as I showed above.

If I remember correctly Cytoscape can be run outside of the shell (at least on the Mac) by simply opening the application and then loading the QIIME produced file(s) using the GUI menu.
 
 
2) Have I performed a reasonable analysis? In other words, would you agree that the negative result I have is because my treatment groups don't differ in bacterial diversity or is there something I have missed?

I do see (from the plots) some separation between the two groups in the genus PCoA plot - the separation appears to be along the second principal coordinate. In Principle Component Analysis, you could take a look at the loadings plot to see what factors/variables contribute to which component (e.g. Treatment) and make some judgement as to what may be causing the discrimination along PC2 (it may be Treatment after all, although you would hypothesise not). In Principal Coordinate Analysis, i don't think loadings are obtainable as it is a MDS technique based on the distance matrix. I might be wrong.

Regards,
Matt
 


LauraWP

unread,
Oct 3, 2012, 11:26:35 AM10/3/12
to qiime...@googlegroups.com
Hi George,

3) Would adding V1-V2 sequence reads from the same samples to the analysis have a reasonable chance of changing the result, or is the negative result from the V6 reads pretty much the end of the road? Put another way, is there something else you would try if you got this result, or would you conclude that the experiment had a negative result and move on?

At the initial stages of the analysis it is important to look at your data in many different ways to get a sense of the patterns that are in there, so no I would not give up yet. Several things to try:
unweighted unifrac often leads to patterns that are more clear. 

compare_categories.py - this has a bunch of statistical tests to determine whether there is a significant difference between treatments even though it is not obvious in the PCoA plot.
OTU category significance - are the taxa or OTUs that are different across treatments? You can do this on the OTU table and also on higher taxonomic levels (use summarize_taxa.py and then you will need to convert these back to .biom files with convert_biom.py using the options --otu_table_type="otu table" --header_key=taxonomy)
supervised_learning.py also allows you to assess differences among groups.


If you currently have V1-V2 data I would absolutely recommend looking at it as well.
Note that changing the taxonomy assignment to genus level will not impact your beta diversity results.

4) What did I do wrong when I tried to run the analysis using a modified qiime_parameters.txt specifying the genus-level mapping file?

First a really simple question, did you specify the new parameters file in your pick_otus_through_otu_table.py command with -p?   It looks like what you did should work. 

Best,
Laura

Udeshika

unread,
Apr 5, 2013, 12:29:56 AM4/5/13
to qiime...@googlegroups.com
Hi George,
Can you please send me the perl scripts that you used to convert PGM data to Qiime format.The FASTQ file I got don't have barcode and the primer seq. So I have to insert the barcodes to the reads from different samples. Can you please send me your per lcommnds.

Udeshika

David Visi

unread,
May 24, 2013, 11:38:08 PM5/24/13
to qiime...@googlegroups.com
I was wondering if I could obtain the Perl script.

Helen Mitchell

unread,
Jul 25, 2013, 9:17:13 PM7/25/13
to qiime...@googlegroups.com
Dear George,
Thank you so much for this post. Clearly it is proving invaluable for lots of Ion Torrent Users (myself included). Would I also be able to obtain your python scripts for renaming the files please?
Thanks,
Helen


On Wednesday, October 3, 2012 7:40:55 AM UTC+10, GeorgeWatts wrote:

Helen Mitchell

unread,
Jul 25, 2013, 9:21:25 PM7/25/13
to qiime...@googlegroups.com
Sorry, That should be pearl scripts (not python).

SAMIK

unread,
Nov 18, 2013, 9:48:15 AM11/18/13
to qiime...@googlegroups.com
hi George, 

Thank you for the nice description, I have also 10 separate fasta files, can you please provide the PERL script so that i can merge and make compatible with qiime? 

Kind Regards
Samik

Gregg Iceton

unread,
Nov 20, 2013, 12:08:49 PM11/20/13
to qiime...@googlegroups.com
Things have moved on since then.  To use QIIME with Ion Torrent data simply create a FASTQ file within the Torrent Suite.  You can then work on this directly in QIIME - the quality encoding is a standard format (phred+33).  It is only the SFF file that Ion Torrent produced which was incompatible with QIIME.

Nastassia Patin

unread,
Jan 9, 2014, 3:29:34 PM1/9/14
to qiime...@googlegroups.com
Hey Gregg, can you elaborate a little more on this? I have some fastq files that I would like to analyze with QIIME, and I assumed I would have to create two new files a la 454 style (a fasta and a map file). How would you do it with the fastq file alone?

Thanks,
Nastassia

SAMIK BAGCHI

unread,
Jan 9, 2014, 4:43:00 PM1/9/14
to qiime...@googlegroups.com
Hi Nastassia, 

I believe you are using QIIME 1.7. You can use convert_fastaqual_fastq.py script which can convert your fastaq file to qiime format fasta and qual files. 


--
 
---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/NEotQnTQlQo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--

Kind Regards

 
Samik Bagchi
Post-Doctoral Fellow
Water Desalination and Reuse Center (WDRC)
Building 4, Level 4, 4266-WS03
King Abdullah University of Science and Technology (KAUST)
Thuwal. 23955-6900 Kingdom of Saudi Arabia

Gregg Iceton

unread,
Jan 10, 2014, 10:34:52 AM1/10/14
to qiime...@googlegroups.com
Samik is correct :)

Nastassia Patin

unread,
Jan 10, 2014, 12:40:00 PM1/10/14
to qiime...@googlegroups.com
Thanks very much to both of you! That will certainly make things easier. Much appreciated.

-Nastassia

George Watts

unread,
Mar 31, 2015, 3:21:05 PM3/31/15
to qiime...@googlegroups.com
To those who've asked for my script on this thread, I apologize for the lack of response, I haven't checked the thread in a long time and didn't have notifications turned on (I do now). I'm posting to clarify SAMIK and Greg's posts above. QIIME has been updated quite a bit since I last posted and indeed one no longer needs any scripts outside of QIIME to get Ion Torrent data working in QIIME. In particular, two scripts have been added that can get you from a Ion Torrent FASTQ file to a QIIME ready FASTA file. Here's a walk-through:
1) Following a run on the Ion Torrent PGM or Proton, one can download barcode separated FASTQ files, one for each sample. Let's suppose we have two samples 17 and 18 and the FASTQ files we downloaded are 17.fastq and 18.fastq.

2)First we need to convert to FASTA files using:
$ convert_fastaqual_fastq.py -c fastq_to_fastaqual -f 18.fastq -o FASTAs
$ convert_fastaqual_fastq.py -c fastq_to_fastaqual -f 17.fastq -o FASTAs
Note that we're outputting to a new directory called FASTAs, which yields:
$ ls FASTAs/
17.fna  17.qual  18.fna  18.qual
We're converted FASTQ to FASTA (.fna) and .qual files. However, the headers in the FASTA files are not what QIIME needs, note that there is no sample name and read number in the headers of the 17.fna file:
$ head FASTAs/18.fna 
>8N0Y9:00034:00047
ATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTCGATTGGGTGAGTTTG
>8N0Y9:00042:00026
ATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTAGCCGTTGGATCCTTGAGATCTTAGTGGCGCAGCTAACGA
>8N0Y9:00044:00048
ACGAGCTGACGACAGCCATGCAGCACCTGTCTCAGAGTTCCGAAGGCAC
>8N0Y9:00048:00030
ACGAGCTGACGACAGCCATGCAGCACCTGTTCA
>8N0Y9:00049:00026
ATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTAGCCGTTGGGATCCTTGAGATCTTAGTGGCGCAGCTAACGCATAAGTCCC

When what we need is something like this:

>18.V1V2_0 8N0Y9:00034:00047
ATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTCGATTGGGTGAGTTTG
>18.V1V2_1 8N0Y9:00042:00026
ATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTAGCCGTTGGATCCTTGAGATCTTAGTGGCGCAGCTAACGA
>18.V1V2_2 8N0Y9:00044:00048
ACGAGCTGACGACAGCCATGCAGCACCTGTCTCAGAGTTCCGAAGGCAC
>18.V1V2_3 8N0Y9:00048:00030
ACGAGCTGACGACAGCCATGCAGCACCTGTTCA
>18.V1V2_4 8N0Y9:00049:00026
ATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTAGCCGTTGGGATCCTTGAGATCTTAGTGGCGCAGCTAACGCATAAGTCCC

This is solved in the next script.

3) We need to add QIIME compatible headers to the FASTA files and combine them into a single FNA (=FASTA) file for use with QIIME. This can be done using the second script: add_qiime_labels.py. This script requires a Mapping text file with a columns telling QIIME the sample name to add to the header and the name of the FASTa file to add the sample names to. In addition, while the script does not need the information, it will throw an error if there are not several additional columns expected in a Mapping.txt file (see the QIIME tutorial for more details on required columns). Here is a working Mapping file for the two FASTA files we're working with:
$ more ULCERS.Mapping.txt 
#SampleID BarcodeSequence LinkerPrimerSequence Description FASTAfilename
17.V1V2 ACTCA YATGCTGCCTCCCGTAGGAGT DA-1 17.fna
18.V1V2 ACTCA YATGCTGCCTCCCGTAGGAGT DA-2UC 18.fna
Note that we have a header line that starts with a "#" and 5 columns total. The barcode and linker sequences are not of importance here, I just used the ones from the sample Mapping file used in the QIIME tutorial, but you must have these columns. (Sidenote, it's probably worthwhile to actually get your primer sequences correctly put into the Mapping file since you'll need then to run other QIIMe scripts that can strip our primer sequences). Similarly, the description can be what you want and while the descriptions are not needed, you need a description column in the Mapping file. Lastly, we have the column that tells QIIME the FASTA file name to which we will add the SampleID. Now that we have a Mapping file we can run the script:
$ add_qiime_labels.py -i FASTAs/ -m ULCERS.Mapping.txt -c FASTAfilename -o combined_fasta
Note that we're showing the script our FASTAs directory created above as input, the mapping file, the column name in the mapping file ("FASTAfilename"), and a name for the output directory which now contains:
$ ls combined_fasta/combined_seqs.fna
and this file now has all the reads from both FNA files with QIIME-compatible headers:
$ head combined_fasta/combined_seqs.fna 
>18.V1V2_0 8N0Y9:00034:00047
ATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTCGATTGGGTGAGTTTG
>18.V1V2_1 8N0Y9:00042:00026
ATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTAGCCGTTGGATCCTTGAGATCTTAGTGGCGCAGCTAACGA
>18.V1V2_2 8N0Y9:00044:00048
ACGAGCTGACGACAGCCATGCAGCACCTGTCTCAGAGTTCCGAAGGCAC
>18.V1V2_3 8N0Y9:00048:00030
ACGAGCTGACGACAGCCATGCAGCACCTGTTCA
>18.V1V2_4 8N0Y9:00049:00026
ATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTAGCCGTTGGGATCCTTGAGATCTTAGTGGCGCAGCTAACGCATAAGTCCC
This new file is now ready for otu picking scripts such as pick_otus or pick_open_reference_otus.py

Finally, I word of caution: While the reads in FASTQ files available on Ion torrent sequencer have had their barcodes removed, the primer sequences you used to generate the 16s amplicons are still present and could affect your results. You can set up your sequence server to strip the primers from the reads for you or use additional scripts in QIIME to remove primer sequence (extract_barcodes.py), and then utilize the quality and length filtering built into split_libraries.py.

Gregg Iceton

unread,
Apr 1, 2015, 5:05:20 AM4/1/15
to qiime...@googlegroups.com
Just to add my two penneth, for me there's an easier way to do this.  When setting up the run in Torrent Suite, don't indicate that barcodes have been used.  This will produce a single FASTQ file with non-demultiplexed data.  Then convert that FASTQ to FASTA and QUAL files as George described, and then run split_libraries.py using a mapping file containing the barcodes and the sequencing primer.  That way you get the primer removed as part of the process and have fewer commands to enter.

George Watts

unread,
Apr 1, 2015, 3:29:08 PM4/1/15
to qiime...@googlegroups.com
That is definitely a clean approach Gregg, the only drawback is you can't see barcode-specific read counts and read length histograms in the run report which would give you the confidence to go into QIIME knowing you had a "good" and "even" run.

Gregg Iceton

unread,
Apr 9, 2015, 4:22:20 AM4/9/15
to qiime...@googlegroups.com
Agreed - I had to reanalyse a dataset on Torrent Suite just recently for that very reason.  I would imagine the most time efficient and accurate method (due to primer removal) would be to have the Torrent Suite run the analysis without barcodes initially, download the FASTQ and start the conversion / split libraries process and whilst that's happening have Torrent Suite reanalyze with the barcodes set.  That way you have the best of both worlds :)
Reply all
Reply to author
Forward
0 new messages