Difference between vsearch and usearch clustering outputs

298 views
Skip to first unread message

Sven LE MOINE BAUER

unread,
Feb 11, 2021, 8:21:27 AM2/11/21
to VSEARCH Forum
Hi everyone,
I am facing an issue with the OTU clustering using vsearch.
As a background: I filter my ION torrent 16S sequences using more or less the pipeline described on the vsearch webpage: cutadapt to remove primers, length trimming at 220b, quality filtering at maxee = 2, pool samples, derep, chimera removal (denovo and ref-based), OTU clustering at 97%, chimera again, tax assignements using CREST, and decontam.

It started to question the clustering because I thought I had a lot of OTUs in my dataset. So I extracted the fasta sequences (centroids) of all OTUs assigned to the sulfurovum genus (146). I did an alignment with them and found that a bunch of the sequences had actually less than 3% dissimilarity. So why do they end up in different OTUs? I tried to rerun the vsearch clustering on the 146 seqs, and it is consistent, it still finds 146 OTUs. However, running these 146 seqs using the usearch clustering command, I find only 80 OTUs!

That was weird, and I wanted to dig a bit more into it by trying to run the same fasta file through both vsearch and usearch.
1) I prepared a dereplicated file using usearch:
usearch11 -fastx_uniques 5-all.seqs.fasta -sizeout -relabel Uniq -fastaout 6-uniques.fa

2) Then ran these ~360 000 unique seqs through usearch and vsearch in parallel:
# OTU clustering at 97% using usearch
usearch11 -cluster_otus 6-uniques.fa -otus 7-usearch-otus.fa -relabel Otu

#OTU clustering at 97% using vsearch
$VSEARCH --threads $THREADS \
    --cluster_size 6-uniques.fa \
    --id 0.97 \
    --strand plus \
    --sizein \
    --sizeout \
    --fasta_width 0 \
    --relabel OTU_ \
    --centroids 7-vsearch-otus.fasta 

# Because usearch removes singletons and chimeras, I ran the vsearch output through a python script to remove singletons, and then through chimera detection:
$VSEARCH --threads $THREADS \
    --uchime_denovo 7-vsearch-otus.nosin.fasta \
    --sizein \
    --sizeout \
    --fasta_width 0 \
    --nonchimeras vsearch-otus.nosin.nochim.fasta

Result of the test: 18748 OTUs for vsearch, 11527 for usearch.
Why this difference? I know that we do not know the exact usearch algorithm, but nevertheless it should be roughly same, no? The difference is huge.
Am I missing something or doing something wrong? I really do not understand why some of my sulfurovum OTUs with low dissimilarity do not end up together.
I can send some fasta files if somebody wants to reproduce the issue.
Thank you for your help.
Sven

Sven LE MOINE BAUER

unread,
Feb 11, 2021, 2:01:18 PM2/11/21
to VSEARCH Forum
Ho well, I guess I solved it myself. It is because usearch removes singleton sequences prior to clustering, right? I redid the test described in my previous poist, but with removing singletons prior to clustering with vsearch and I reached roughly 12000 OTUs, which is more similar to what usearch finds.
However, that does not tell me why in my 146 sulfurovum OTUs some have less than 3% dissimilarities.
Also, I was used to remove singleton OTUs, not sequences. Removing singletons during the sequence dereplication removes ca 80% of the clusters created. That is insane! Is it really worth it?

Torbjørn Rognes

unread,
Feb 12, 2021, 10:42:29 AM2/12/21
to VSEARCH Forum
Hi

The cluster_otu command in usearch performs combined clustering and chimera removal and also eliminates low-abundance sequences or clusters. I do not know exactly what it does in detail, but I think it is known to be rather strict. It is not a surprise that it ends up with much fewer OTUs.

I think the reason why you sometimes will find a few sequences that are less than 3% different among the centroids is that vsearch (and usearch) are heuristic algorithms. When aligning and clustering with a heuristic algorithm it is not guaranteed that it will find the absolutely most similar sequence. For example, if there are many sequences that are 96% identical and just one that is at least 97% similar, the heuristics may lead to the the most similar sequence to be overlooked. In that case, sequences that are less than 3% different may end up as centroids. If you increase the values specified with the options "--maxaccepts" and "--maxrejects" you can make it more accurate, but also slower. 

- Torbjørn

Sven LE MOINE BAUER

unread,
Feb 13, 2021, 10:27:38 AM2/13/21
to VSEARCH Forum
Hei Torbjørn,
thanks for your answer. I understand now.
There is a couple of things that I found weird though:
* Regarding the removal of singletons prior to OTU clustering. usearch does it, but then reuses the singletons when mapping all the sequences to the OTUs that were generated. However with vsearch, the singletons that were removed during dereplication are never used back for the mapping. They are lost in a way. I tested that with a set of 19 samples, running only denovo chimera prior to OTU clustering, not ref-based chimera detection:
usearch pipeline: % of output reads per sample in the otu table compared to post quality filtration: min = 88%, max = 98%, avg = 94%
vsearch pipeline without singleton removal at derep stage: min = 92; max = 99; avg =95%
vsearch pipeline with singleton removal at derep stage: min 73%, max = 92%, avg = 83%.
So we obviously lose 10/12% of the sequences by removing singletons during dereplication in the vsearch pipeline. I am not sure what is the biological significance of these singletons though... Anyway, is that the way it is supposed to be? Or am I doing something wrong?

* In the vsearch pipeline, there are 2 dereps: within each sample, and after concatenating all samples. I realized that I cant get the perl script (map.pl) to work if I would not do the derep within each sample. As I do not understand anything at perl I cannot check by myself if that is the way it is, or if I am doing wrong. I guess the question could be this: Does the perl script requires that sequences are unique within each sample?

Sven.

Torbjørn Rognes

unread,
Feb 15, 2021, 12:38:33 PM2/15/21
to VSEARCH Forum
Hi

You are right in that the VSEARCH example pipeline throws away those sequences that are only present once in the entire dataset. Those that are present at least two times, possibly in different samples, are included. This is intended.

If you prefer to use those sequences as well, you could try the alternative pipeline (at https://github.com/torognes/vsearch/wiki/Alternative-VSEARCH-pipeline). It will make clusters based on all sequences that are not considered chimeras. It will exclude clusters with just a single sequence. The rest will form the OTUs. It will then map all original sequences to the centroids of the clusters. It is probably more similar to the usearch pipeline you used. It is a bit slower that the other pipeline due to the search performed during the mapping.

I am not really sure what is best from a biological viewpoint. The pipelines presented on the VSEARCH Wiki pages are intended only as examples, and their accuracy has not been determined in detail.

The perl script can be used to find all the original fasta entries that where clustered or dereplicated. It requires (1) the original fasta file, (2) the "uc" file produced during clustering or dereplication, as well as (3) the fasta file with the centroids or dereplicated sequences. It is used in the first pipeline to simply pick the  sequences that where clustered or dereplicated into the final OTUs, excluding those that where considered chimeras or those that were only found once in whole dataset. It should be faster than performing a search against the OTU centroids for each sequence.

- Torbjørn

Torbjørn Rognes

unread,
Feb 15, 2021, 12:54:37 PM2/15/21
to VSEARCH Forum
You could also have a look at Fred's metabarcoding pipeline:

Sven LE MOINE BAUER

unread,
Feb 17, 2021, 8:07:32 AM2/17/21
to VSEARCH Forum
Nice. I did not see that alternative pipeline. I ran it and find nearly the same amount of OTUs than with the usearch pipeline.

I am trying swarm now and will try to compare the outputs. I am a bit confused though with the use of swarm in the alternative pipeline. 
The way it is done (if I understand well): the swarm clusters are made on the unique seqs (without singletons if desired) and later all sequences are mapped to these swarm clusters. But the mapping is done at a 97% identity. 
Isn't that going against the swarm approach? I am not sure if I am clear... sorry, not sure how to explain properly. Somehow the mapping at 97% feels like this is not what swarms wants to achieve. Is that what happens in Fred's python script to build the otu table?

Sven LE MOINE BAUER

unread,
Feb 17, 2021, 8:45:44 AM2/17/21
to VSEARCH Forum
Definitively something off there: I end up with less OTUs in the OTU table after the mapping than in my OTU fasta file. So some of the sequences that became centroids to some swarm clusters (OTUs) do not get mapped to that OTU. Does not seem right, right?

Frédéric Mahé

unread,
Feb 18, 2021, 2:01:13 PM2/18/21
to VSEARCH Forum


Hi Sven, you seem to have spotted a difference between vsearch and usearch. Would you mind sharing some of your data with me? Maybe the fasta file "6-unique.fa"? I would like to try to replicate your process.

Sven LE MOINE BAUER

unread,
Feb 19, 2021, 2:16:02 AM2/19/21
to VSEARCH Forum
Sure. I ll PM you immediately.
Reply all
Reply to author
Forward
0 new messages