Hi everyone,
I am facing an issue with the OTU clustering using vsearch.
As a background: I filter my ION torrent 16S sequences using more or less the pipeline described on the vsearch webpage: cutadapt to remove primers, length trimming at 220b, quality filtering at maxee = 2, pool samples, derep, chimera removal (denovo and ref-based), OTU clustering at 97%, chimera again, tax assignements using CREST, and decontam.
It started to question the clustering because I thought I had a lot of OTUs in my dataset. So I extracted the fasta sequences (centroids) of all OTUs assigned to the sulfurovum genus (146). I did an alignment with them and found that a bunch of the sequences had actually less than 3% dissimilarity. So why do they end up in different OTUs? I tried to rerun the vsearch clustering on the 146 seqs, and it is consistent, it still finds 146 OTUs. However, running these 146 seqs using the usearch clustering command, I find only 80 OTUs!
That was weird, and I wanted to dig a bit more into it by trying to run the same fasta file through both vsearch and usearch.
1) I prepared a dereplicated file using usearch:
usearch11 -fastx_uniques 5-all.seqs.fasta -sizeout -relabel Uniq -fastaout 6-uniques.fa
2) Then ran these ~360 000 unique seqs through usearch and vsearch in parallel:
# OTU clustering at 97% using usearch
usearch11 -cluster_otus 6-uniques.fa -otus 7-usearch-otus.fa -relabel Otu
#OTU clustering at 97% using vsearch
$VSEARCH --threads $THREADS \
--cluster_size 6-uniques.fa \
--id 0.97 \
--strand plus \
--sizein \
--sizeout \
--fasta_width 0 \
--relabel OTU_ \
--centroids 7-vsearch-otus.fasta
# Because usearch removes singletons and chimeras, I ran the vsearch output through a python script to remove singletons, and then through chimera detection:
$VSEARCH --threads $THREADS \
--uchime_denovo 7-vsearch-otus.nosin.fasta \
--sizein \
--sizeout \
--fasta_width 0 \
--nonchimeras vsearch-otus.nosin.nochim.fasta
Result of the test: 18748 OTUs for vsearch, 11527 for usearch.
Why this difference? I know that we do not know the exact usearch algorithm, but nevertheless it should be roughly same, no? The difference is huge.
Am I missing something or doing something wrong? I really do not understand why some of my sulfurovum OTUs with low dissimilarity do not end up together.
I can send some fasta files if somebody wants to reproduce the issue.
Thank you for your help.
Sven