Ion Torrent, fastq and bam files

2,825 views
Skip to first unread message

Sergio Balzano

unread,
Jan 28, 2014, 8:59:40 PM1/28/14
to qiime...@googlegroups.com
Dear all
my current issue is related to some Ion torrent data I'm working with and is similar to that explained in this topic (https://groups.google.com/forum/#!searchin/qiime-forum/ion$20torrent$20bam/qiime-forum/Tl1EvzXbf9U/3d5t6BioVHIJ)
 
I received my Ion torrent data as a .bam and .fastq file and according to the run report bad quality sequences (low quality and polyclonal) were already removed, so I (probably wrongly) assumed that my data were already clean and ready to use.
 
1) I trimmed all reads too short, converted my fastq file into fasta and split my fasta file into different samples according to the barcodes using biopython
 
2) After that I started using qiime (I didn't know i could do some of the previous steps with qiime) to remove chimerae, group otus and assign taxonomy
 
3) I noticed that the taxonomy results are quite poor, with too many (up to 5 %) reads with no blast hit, where 'no blast hit' means that the closest match has < 90% blast with my sequence. Given that, I'm working with 16S/18S data it sounds quite weird to have so many unassigned reads. Even blasting manually (to the genbank) some randomly picked unassigned read they have very low match (<90%) with known sequences. So I thought that my data need some further cleaning.
 
4) I went back to the fasta and qual files (obtained from my initial fastq file, after using the SeqIO.convert command from biopython) and I analysed my .qual file using quality_scores.py. From the output pdf file it looks that although my data are not very clean, the quality score for most data is > 20, or > 25. however there is a high standard deviation and some data definitely need to be removed. 
 
5) I went back to the fasta and qual files (obtained from my initial fastq file, after using the SeqIO.convert command from biopython) and used split_library.py to clean and split them. However even setting a low phred threshold (20) < 5 % of my data go through the filtration step. Below the command line used
 
$split_libraries.py -m my_mapping_file.txt -f all_data.fasta -q all_data.qual -l 200 -L 470 -t -s 20 -k -H 8 -M 3 -b 11 -o my_output_directory -w  50 -d
 
The only doubt I have is about the -b option, my barcoded are 11 bp long so I think I typed it right.
 
6) Then reading on qiime forum and other forums I noticed that (1) fastq files are not uniquely defined and those deriving from Ion Torrent are different from 454. Therefore even the conversion from fastq to qual (which I didn on biopython) might be wrong. The conversion from fastq to fasta should not be wrong, it's just a matter of replacing "@" with ">" and removing the quality score line.
 
7) I read then that qiime does not support Ion torrent files and therefore results from both quality_scores.py and split_library.py might be wrong.
 
8) I would like to hear from other qiime users who worked with ion torrent data processed with Torrent Suite 3.6 or newer (where they started using bam files instead of sff).
 
 
 
Thanks for your attention
 
regards
 
Sergio
 
 
 
 

Gregg Iceton

unread,
Jan 29, 2014, 6:02:15 AM1/29/14
to qiime...@googlegroups.com
Hi.  I use QIIME to analyse Ion Torrent data very regularly.  The FastQ file produced by Torrent Suite is perfectly compatible with QIIME - it is the SFF file that is not.  Firstly, bear in mind that Ion Torrent phred scores are consistently under called.  I don't have a reference for this to hand, but the rule of thumb is knock 5 off what you would expect from Illumina e.g. Illumina phred 25 = Ion Torrent phred 20.  So a cut off of 20 is not as low as you might think.  The Torrent Suite will trim the sequences until they have an average read quality of 15 by default.  If you have access to the Torrent Suite, you could reanalyze your data to phred 20 and then forget about quality scores in QIIME.

Your split libraries command looks fine, though you are using a sliding window for additional quality filtering which may be where you are losing some reads.  If you post your split libraries log then this would help elucidate the problem.

How was your FASTQ created?  You can do this within Torrent Suite which I know works fine.

Sergio Balzano

unread,
Jan 29, 2014, 5:47:05 PM1/29/14
to qiime...@googlegroups.com
Hi Gregg
 
thanks for your advise.
 
1) I got my amplicons sequenced from a company (Australian Genome Research Facility) and they sent me a fastq and a bam file (I read that IonTorrent does not use sff files anymore). I don't have Torrent Suite and as far as I was said yesterday I cannot download it because I don't have a PGM. I sent an email to the company who did the sequencing for me and  asked how the fastq file was created and I'm currently waiting for an answer.
 
2) I use the sliding window of 50 bp according to some previous work published form other authors (same genes, same kind of environment but they used 454).
 
3) Below the split library log. Looking at the log more in details I think the main problem is related to primer mismatches. What is weird is that my run included both 16S and 18S amplicons and primer mismatches occurred in both 18S and 16S samples. Since I processed all 18S PCR (and similarly all 16S PCR) exactly in the same way it looks weird that mismatches occurred systematically in some samples.
 
Sergio
 
Length outside bounds of 200 and 470    2989469
Num ambiguous bases exceeds limit of 6  0
Missing Qual Score      0
Mean qual score below minimum of 20     3318
Max homopolymer run exceeds limit of 8  1691
Num mismatches in primer exceeds limit of 3: 706506
Size of quality score window, in base pairs: 50
Number of sequences where a low quality score window was detected: 122410
Sequences with low quality score window were truncated to the first base of the window.
Sequences discarded after truncation due to sequence length below the minimum 200: 50911
Sequence length details for all sequences passing quality filters:
Raw len min/max/avg     231.0/501.0/365.1
Wrote len min/max/avg   189.0/489.0/304.5
Barcodes corrected/not  1927/3247
Uncorrected barcodes will not be written to the output fasta file.
Corrected barcodes will be written with the appropriate barcode category.
Corrected but unassigned sequences will not be written unless --retain_unassigned_reads is enabled.
Total valid barcodes that are not in mapping file       0
Sequences associated with valid barcodes that are not in the mapping file will not be written.
Barcodes in mapping file
Num Samples     17
Sample ct min/max/mean: 44 / 52336 / 8958.47
Sample  Sequence Count  Barcode
Penneuk1Oct     52336   TACTCACGATA
Penneuk1Dec     32307   TGATGATTGCC
Penneuk5Dec     26077   TCGATAATCTT
Pennprok1Oct    24073   TGACATTACTT
Coo3Nov 16223   TGACCGCATCC
Penneuk1Mar     276     TCTTACACCAC
Coo5Nov 166     TGGTGTAGCAC
Penneuk5Oct     151     TCGTGTCGCAC
Penneuk5Mar     114     TAGCCAAGTAC
Coo6Nov 100     TGAAGTAGCAC
Pennprok5Mar    81      TCAAGCACCGC
Coo1Nov 78      TAGCTTACCGC
Penneprok5Oct   71      TGCCTTACCGC
Pennprok5Dec    69      TGCAAGCCTTC
Pennprok1Mar    68      TACATTACATC
Coo2Nov 60      TCATGATCAAC
Pennprok1Dec    44      TACCGAGGCAC

Sergio Balzano

unread,
Jan 29, 2014, 6:03:55 PM1/29/14
to qiime...@googlegroups.com
Hi Gregg
so a fastq file generated by Ion torrent has got exactly the same structure/format (phred+33) as a fastq file generated by Illumina, 454 or any other sequencing platform? therefore qiime can read and process it exactly in the same way.
Sergio
 

On Wednesday, 29 January 2014 21:32:15 UTC+10:30, Gregg Iceton wrote:

Gregg Iceton

unread,
Jan 29, 2014, 6:10:57 PM1/29/14
to qiime...@googlegroups.com
HI Sergio

The default file type produced by Torrent Suit is BAM.  However you can use a plugin to create FASTQ and SFF.  As you say in your subsequent post, the FASTQ is the same as most other types (note that earlier Illumina platforms produced FASTQ with phred+66 and other variations).  So if the FASTQ was made with Torrent Suite then it is definitely of the standard format.  However there are other programs which will convert BAM to FASTQ and I cannot comment on their suitability as I don't use them.

Primer mismatches are inevitable due to errors introduced in the PCR and errors when sequencing.  In addition, due to non-specific annealing you will find that you primers sometimes anneal to a slightly different sequence and these primer mismatches are not in fact errors but are due to no primer being truly "universal".  You might consider turning off primer matching depending on what downstream analysis you are doing, since if the sequence is not 16S it will not be matched in the databases.  That would be your call.

However, I would suggest the most significant issue is your loss of almost 3 million reads due to being outside the bounds of your specified minimum and maximum length!  I would have a look at the read length histogram from your sequencing company if they supplied it to see where these lost reads reside i.e. are they too big or too small.  You could also do this by altering your minimum accepted size but leaving the maximum the same, and vice versa.

Sergio Balzano

unread,
Jan 29, 2014, 10:12:30 PM1/29/14
to qiime...@googlegroups.com
Hi again Gregg
thanks for further explainations.
 
Yes I know that the most significant issue of my data is that 70 % of reads are too short. I was said that this happened because even though I used a Ion Torrent chip designed for 400 bp I sent amplicons (adaptor + barcode + forward primer + PCR product + reverse primer + adaptor) which were too long (470 bp) and this caused problems to the emulsion PCR, producing short amplicons. The primers I used amplify conserved regions of the SSU but were previously used for Sanger or 454 which can handle longer read length. However I still have many sequences to analyse.  
 
For the primer mismatch I think that when entering primer sequences containing degenerate nucleaotides something goes wrong in the split_library and too many sequences were discarded. I turned off primer matching (-M length_of_my_forward_primer) and then non specific amplicons will be removed after assign taxonomy.
 
Sergio
Reply all
Reply to author
Forward
0 new messages