Hello all,
I recently ran the command below on a quality filtered FASTA file of 808,184 Illumina reads from ~10 low diversity environmental samples.
pick_otus.py -i ../seqs.fasta -r ./97_otus.fasta -s 0.97 -m usearch61 --threads 8 -o otu_output_pick_manual_usearch
The command completes successfully, and the log file shows:
Usearch610DeNovoOtuPicker parameters:
Application:usearch61
minlen:64
output_dir:otu_output_pick_manual_usearch
percent_id:0.97
remove_usearch_logs:False
rev:False
save_intermediate_files:True
sizeorder:False
threads:8
usearch61_maxaccepts:1
usearch61_maxrejects:32
usearch61_sort_method:abundance
usearch_fast_cluster:False
verbose:False
wordlength:8
Num OTUs:98403
98,000 OTUs? How is that possible? Upon building the reference set, I then aligned a portion of them against each other using MUSCLE and found that many were more than 97% (many 98 or 99.xx% identical representative sequences) similar to other OTU references - what is going on?
I encounter a similar issue when running pick_open_reference_otus.py:
pick_open_reference_otus.py -i ./seqs.fasta -r ./97_otus.fasta -o ./otu_output_pipeline/ -m usearch61 -s 0.1
In this case, we get the following OTU numbers per step:
Step1: 280
Step2: 104
Step3: 104
Step4: 88364
Anyone understand why I'd be getting so many OTUs at the denovo level, or why the OTUs created are not distinct at the 97% threshold? I've been able to replicate this on a couple of different datasets. Using UCLUST also gives an OTU figure far too high (~2,000).
Here is my QIIME config:
Thanks!
Alex