Ok, let's see if I can detail what I've found in a concise way.
The first step was ensuring that de novo OTU picking was done properly and equitably between usearch and vsearch. To do that, I ended up using usearch v. 4.1.93 because it was the highest version I could find which allows the user to specify the definition of ID when clustering.
My input reads are a modified version of the Green Genes (May, 2013) database. Before doing OTU picking on these reads, I trimmed them to the V4 region, de-replicated them, removed any reads with ambiguous base calls (more details below) and then sorted by decreasing read length. This gives me 317,479 input reads.
My OTU picking commands for usearch and vsearch (respectively) are:
usearch4.1.93 --cluster /home/data_repo/pre_processing/otu_support_files/gg_13_5/gg_13_5_pynast_left_2264_right_4052_derep_no_ambig_sort.fasta --id 0.97 --nofastalign --uc usearch.uc --iddef 4
vsearch --fasta_width 0 --cluster_size /home/data_repo/pre_processing/otu_support_files/gg_13_5/gg_13_5_pynast_left_2264_right_4052_derep_no_ambig_sort.fasta --centroids vsearch.fna --id 0.97 --qmask none --notrunclabels --iddef 4 --uc vsearch.uc
You'll notice in both cases I'm specifying iddef 4 (BLAST definition of ID) and I've specified --nofastalign for usearch which turns off the heuristic filtering. This, by definition, is always turned off in vsearch.
Runtime
Commands run on 10 core server with 328GB of ram (Ubuntu)
- usearch - 8:40
- vsearch - 1:13
Runtimes were substantially longer with usearch (also as noted in the documentation)
USEARCH by default uses a heuristic procedure involving seeding, extension and banded dynamic programming. If the --fulldp
option is specified to USEARCH it will also use a full dynamic programming approach, but USEARCH is then considerably slower.
Output
- usearch - 78,376 clusters
- vsearch - 66,307 clusters
- 57,458 clusters common to both methods (common = had the same ID for the cluster)
- 20,918 unique to usearch
- 8,849 unique to vsearch
![](https://lh3.googleusercontent.com/-96oGWyUduUg/V-LyNjMAufI/AAAAAAAAAvA/_0JD7FuM75kcEIwtXUr60wZOG92V4McsQCLcB/s1600/download.png)
Looking at the distribution of cluster sizes, we see that vsearch tends to generate larger clusters (clusters with more binned reads in them). It should also be noted that the largest cluster for vsearch had 2779 reads in it; whereas the largest for usearch is 486 reads). So, it looks like vsearch is better able to pick out centroids that are within 97%ID of a larger number of reads than is usearch.
It should be noted at this point that, although iddef 4 was specified for both clustering methods, it seems like behind the scenes they aren't 100% equal. For example, in the cases of wildcard bases (e.g. N) usearch doesn't count these as miss-matches - however it looks like vsearch does. Therefore, if you're reads (query or subject) contain degenerate bases, one source of different clustering bins comes from the fact that the %ID match can be quite different between the two methods. (I've seen cases where usearch calculated 98%ID but vsearch calculated it as 92%ID). This is why I removed any reads with degenerate bases when modifying the green genes database.
Next, I investigated (by "hand") some of the unique clusters. For example, vsearch identified a cluster with centroid 1081427 (ID from the ref DB); on the usearch side, this read was lumped into the cluster with centroid 4317881. When I check which reads were lumped into the vsearch cluster (with centroid 1081427) I find the ID 4317881. When I blast these two reads together (1081427 & 4317881) I see they have 97.25% ID. So it appears in this case, that potentially due to a different sorting order, different seeds were chosen for essentially the same cluster.
When I check, for every unique vsearch cluster, the ID match to the read contained in that cluster that was chosen as the centroid on the usearch side, I find that they are all >97%ID. The same goes for the other way around (starting with those centroids that are unique to usearch and checking the vsearch counterpart).