Dear all
I received my Ion torrent data as a .bam and .fastq file and according to the run report bad quality sequences (low quality and polyclonal) were already removed, so I (probably wrongly) assumed that my data were already clean and ready to use.
1) I trimmed all reads too short, converted my fastq file into fasta and split my fasta file into different samples according to the barcodes using biopython
2) After that I started using qiime (I didn't know i could do some of the previous steps with qiime) to remove chimerae, group otus and assign taxonomy
3) I noticed that the taxonomy results are quite poor, with too many (up to 5 %) reads with no blast hit, where 'no blast hit' means that the closest match has < 90% blast with my sequence. Given that, I'm working with 16S/18S data it sounds quite weird to have so many unassigned reads. Even blasting manually (to the genbank) some randomly picked unassigned read they have very low match (<90%) with known sequences. So I thought that my data need some further cleaning.
4) I went back to the fasta and qual files (obtained from my initial fastq file, after using the SeqIO.convert command from biopython) and I analysed my .qual file using quality_scores.py. From the output pdf file it looks that although my data are not very clean, the quality score for most data is > 20, or > 25. however there is a high standard deviation and some data definitely need to be removed.
5) I went back to the fasta and qual files (obtained from my initial fastq file, after using the SeqIO.convert command from biopython) and used split_library.py to clean and split them. However even setting a low phred threshold (20) < 5 % of my data go through the filtration step. Below the command line used
$split_libraries.py -m my_mapping_file.txt -f all_data.fasta -q all_data.qual -l 200 -L 470 -t -s 20 -k -H 8 -M 3 -b 11 -o my_output_directory -w 50 -d
The only doubt I have is about the -b option, my barcoded are 11 bp long so I think I typed it right.
6) Then reading on qiime forum and other forums I noticed that (1) fastq files are not uniquely defined and those deriving from Ion Torrent are different from 454. Therefore even the conversion from fastq to qual (which I didn on biopython) might be wrong. The conversion from fastq to fasta should not be wrong, it's just a matter of replacing "@" with ">" and removing the quality score line.
7) I read then that qiime does not support Ion torrent files and therefore results from both quality_scores.py and split_library.py might be wrong.
8) I would like to hear from other qiime users who worked with ion torrent data processed with Torrent Suite 3.6 or newer (where they started using bam files instead of sff).
Thanks for your attention
regards
Sergio