Aloha,
I am running a 16S data set using QIIME 1.9.1 (latest version of MacQIIME) and trying to understand exactly what is being filtered out. We had a not so great Illumina run (53% of the sequences were tossed out right off the machine due to over clustering and not written to the raw data fastq files). But this is a prelim data set, so I have been analyzing it anyway. For a given sample, I am starting with ~ 70,000 sequences, and after using split_libraries_fastq.py, for all 20 samples the sequences written are roughly ~ 5,000. This is a considerable reduction, which may be appropriate based on the quality of the data, but I wanted to double check that I have the parameters correct.
Here is the command I used for the quality filtering (these were already demultiplexed files):
split_libraries_fastq.py -i RDP_2.fastq,RDP_6.fastq,RDP_10.fastq,RDP_13.fastq,RDP_8.fastq,RDP_12.fastq,RDP_7.fastq,RDP_19.fastq,RDP_18.fastq,RDP_9.fastq,RDP_3.fastq,RDP_20.fastq,RDP_15.fastq,RDP_5.fastq,RDP_17.fastq,RDP_16.fastq --sample_ids RDP_2,RDP-6,RDP_10,RDP_13,RDP_8,RDP_12,RDP_7,RDP_19,RDP_18,RDP_9,RDP_3,RDP_20,RDP_15,RDP_5,RDP_17,RDP_16 -o split_libraries_lava_caves/ -q 19 --barcode_type 'not-barcoded'
That means that I left the defaults for -r --max_bad_run_length = 3; -p, --min_per_read_length_fraction = 0.75, -n, --sequence_max_n = 0. I am not clear on what the -r and -p really mean perhaps?
The only parameter I changed was -q (Q-score) which I set at 19. Previously, when working with 454 data, you could set the length of the sequences min and max, as well as the quality score.
The median length of the sequences were 441 bp, but it varies slightly from sample to sample (441-446 for example). These are PE 300 cycle run, and joined paired ends.
I could use some guidance as I want to make sure that sequences with a lower Q-socre than 20 are being removed, as well as reads that are too long and too short.
I have also run this data set in the new EDGE platform from Los Alamos National Lab (there is a QIIME application within it). You can set the -p value, -q, max number of "N" - but EDGE output for split_libraries.py generates 20-30K sequences written after quality filtering, even if I set the parameters I can control to the same ones in QIIME.
Any help or suggestions appreciated.
Cheers
Becks