I recently ran the command below on a quality filtered FASTA file of 808,184 Illumina reads from ~10 low diversity environmental samples.
pick_otus.py -i ../seqs.fasta -r ./97_otus.fasta -s 0.97 -m usearch61 --threads 8 -o otu_output_pick_manual_usearch
The command completes successfully, and the log file shows:
98,000 OTUs? How is that possible? Upon building the reference set, I then aligned a portion of them against each other using MUSCLE and found that many were more than 97% (many 98 or 99.xx% identical representative sequences) similar to other OTU references - what is going on?
I encounter a similar issue when running pick_open_reference_otus.py:
pick_open_reference_otus.py -i ./seqs.fasta -r ./97_otus.fasta -o ./otu_output_pipeline/ -m usearch61 -s 0.1
In this case, we get the following OTU numbers per step:
Anyone understand why I'd be getting so many OTUs at the denovo level, or why the OTUs created are not distinct at the 97% threshold? I've been able to replicate this on a couple of different datasets. Using UCLUST also gives an OTU figure far too high (~2,000).
Here is my QIIME config: