I am trying to cluster my Illumina MiSeq dataset using Swarm within QIIME. My files are already demultiplexed, so I have run multiple_join_paired_ends.py and multiple_split_libraries_fastq.py. I then ran count_seqs.py on seqs.fna (the output of multiple_split_libraries_fastq.py) to check the number of amplicons in my input file (1,461,190), and I am trying to run pick_de_novo_otus.py. The command I am running is:
pick_de_novo_otus.py -i seqs.fna -o swarm_otus/ -p swarm_params.txt
The parameters file is as follows:
The command runs the clustering successfully and gives me a set of representative sequences, the swarm picked OTUs, and the otu_table.biom. (Unrelated - I am using the COI gene so the taxonomic assignment is failing, but I expected this based on the default reference database. I do plan to use a non-default database, but am planning to address that after I have the OTU clustering working successfully).
This clustering gave me 432,398 OTUs, but when I look at the OTU table, the total number of amplicons is much lower than the number of amplicons in my input file (994,330 in OTU table vs 1,461,190 input). The most abundant swarm/OTU had 4,671 amplicons. I've also run the command with different values for d, (~6-9 - I realize now that these are much too high and am sticking with 1 or 2!), and have similar results - the number of amplicons in the OTU table is much lower than the number in the input file, but varies (~750,000 - 785,000).
To test these results, I ran swarm in standalone mode. I dereplicated the seq.fna file output from multiple_split_libraries_fastq.py using the bash script available at https://github.com/torognes/swarm#dereplication
. I then ran Swarm on this dereplicated file with the following command:
swarm -w amplicons_representatives.fast -o amplicons.swarms -s amplicons.stats amplicons_linearized_dereplicated.fasta
This resulted in 432398 OTUs - the same number as clustering with Swarm via QIIME. However, the most abundant Swarm was 22,903 sequences (vs 1,671 by QIIME, above).
While I haven't been able to compare an OTU table output by standalone Swarm to the OTU table output by QIIME, I can tell that there is a difference between the two methods. I think the clustering is identical between the two analyses, but something is wrong with the abundance values for my OTUs. I would like to use QIIME for downstream steps as well, and don't particularly want to reformat all of my data to create an OTU able using standalone Swarm (as per Frederic Mahe's pipeline), but I don't trust the abundance numbers that QIIME is outputting.
Any help would be much appreciated - I'm not sure what the next step of my troubleshooting should be!