pick_de_novo_otus does not ensure ID%

Constantino Schillebeeckx

unread,

Sep 14, 2016, 10:40:39 AM9/14/16

to Qiime 1 Forum

I'm running pick_de_novo_otus.py on a set of sequences (with uclust) at -o 97; when I check my results, I can find several OTUs that have an ID% >97 when blasted against each other. I've tried adjusting the uclust settings several times to:

max_accepts 50, max_rejects 0
max_accepts 200, max_rejects 0
exact_uclust True (uclust "exact" clustering)

Yet for each of the above settings, I'm still finding OTUs with ID% > 97.

I thought perhaps this could be alleviated by adjusting the word_length option

pick_otus.py -i gg_13_5_pynast_left_2264_right_4052.fasta -o 97/uclust_picked_otus --max_rejects 0 --max_accepts 50 --word_length 30 --similarity 0.97

however I then get a runtime error:

Traceback (most recent call last):
  File "/usr/local/bin/pick_otus.py", line 1004, in <module>
    main()
  File "/usr/local/bin/pick_otus.py", line 792, in main
    result_path=result_path, log_path=log_path, HALT_EXEC=False)
  File "/usr/local/lib/python2.7/dist-packages/qiime/pick_otus.py", line 1286, in __call__
    HALT_EXEC=HALT_EXEC)
  File "/usr/local/lib/python2.7/dist-packages/bfillings/uclust.py", line 585, in get_clusters_from_fasta_filepath
    raise ApplicationError('Error running uclust. Possible causes are '
burrito.util.ApplicationError: Error running uclust. Possible causes are unsupported version (current supported version is v1.2.22) is installed or improperly formatted input file was provided

Is there any way to guarantee that the OTUs picked with of pick_de_novo_otus are all ID% < 97?

Colin Brislawn

unread,

Sep 14, 2016, 1:15:33 PM9/14/16

to Qiime 1 Forum

Hello Constantino,

As you probably already know, uclust uses k-mer heuristics to quickly find top hits and a seed-and-extend algorithm to produce alignments. Both of these are heuristics, and are not perfect by definition. It's the seed-and-extend alignment step which causes things to be slightly off from your 97% threshold.

To solve this issue, you need to use optional alignments (you could call these 'exact alignments' or 'non-heuristics alignments'). You can do this by passing the -fulldp option to usearch, which uses the exact 'full dynamic programing'. This option is called --nofastalign on old version of uclust.

http://drive5.com/usearch/manual/aln_heuristics.html

http://www.drive5.com/uclust/uclust_userguide_1_1_579.html

Vsearch always uses exact alignments. So you could just use vsearch :-)

Colin

Constantino Schillebeeckx

unread,

Sep 19, 2016, 10:42:07 AM9/19/16

to qiime...@googlegroups.com

I was hoping you'd chime in Colin :)

That's exactly what I was looking for. For completeness I ran both versions:

vsearch --fasta_width 0 --cluster_size reads.fasta --clusterout_id --centroids vsearch_centroids.fna --id 0.97 --qmask none --notrunclabels --iddef 4 --uc vsearch_centroids.uc
 
uclust --input reads.fasta --nofastalign --uc usearch.uc --id 0.97 --usersort --fastapairs usearch.fna

Colin Brislawn

unread,

Sep 19, 2016, 10:46:41 AM9/19/16

to qiime...@googlegroups.com

Nice! How do they compare (in terms of speed, number of hits, etc)? Did they get the same hits?

Thanks!

Colin

Constantino Schillebeeckx

unread,

Sep 19, 2016, 10:51:39 AM9/19/16

to Qiime 1 Forum

Good questions!

vsearch was notably faster; the results however, are not the same...

I started with 1,261,500 input reads which were clustered down to:

73,441 reads with usearch
70,071 reads with vsearch

I'm currently investigating the differences and will report back.

Colin Brislawn

unread,

Sep 19, 2016, 11:05:41 AM9/19/16

to Qiime 1 Forum

Very cool. Let me know what you find.

Our conversation reminded me of a vsearch issue, where the dev acknowledges that benchmarks like this are hard. He reminds us to use the same making algorythm in both programs (they are diff by default) and that in both outputs, "The correct number of clusters is probably even lower."
https://github.com/torognes/vsearch/issues/85

Keep in touch,

Colin

Constantino Schillebeeckx

unread,

Sep 21, 2016, 4:47:47 PM9/21/16

to qiime...@googlegroups.com

Ok, let's see if I can detail what I've found in a concise way.

The first step was ensuring that de novo OTU picking was done properly and equitably between usearch and vsearch. To do that, I ended up using usearch v. 4.1.93 because it was the highest version I could find which allows the user to specify the definition of ID when clustering.

My input reads are a modified version of the Green Genes (May, 2013) database. Before doing OTU picking on these reads, I trimmed them to the V4 region, de-replicated them, removed any reads with ambiguous base calls (more details below) and then sorted by decreasing read length. This gives me 317,479 input reads.

My OTU picking commands for usearch and vsearch (respectively) are:

usearch4.1.93 --cluster /home/data_repo/pre_processing/otu_support_files/gg_13_5/gg_13_5_pynast_left_2264_right_4052_derep_no_ambig_sort.fasta --id 0.97 --nofastalign --uc usearch.uc --iddef 4

vsearch --fasta_width 0 --cluster_size /home/data_repo/pre_processing/otu_support_files/gg_13_5/gg_13_5_pynast_left_2264_right_4052_derep_no_ambig_sort.fasta --centroids vsearch.fna --id 0.97 --qmask none --notrunclabels --iddef 4 --uc vsearch.uc

You'll notice in both cases I'm specifying iddef 4 (BLAST definition of ID) and I've specified --nofastalign for usearch which turns off the heuristic filtering. This, by definition, is always turned off in vsearch.

Runtime

Commands run on 10 core server with 328GB of ram (Ubuntu)

usearch - 8:40
vsearch - 1:13

Runtimes were substantially longer with usearch (also as noted in the documentation)

USEARCH by default uses a heuristic procedure involving seeding, extension and banded dynamic programming. If the --fulldp option is specified to USEARCH it will also use a full dynamic programming approach, but USEARCH is then considerably slower.

Output

usearch - 78,376 clusters
vsearch - 66,307 clusters
57,458 clusters common to both methods (common = had the same ID for the cluster)
20,918 unique to usearch
8,849 unique to vsearch

Looking at the distribution of cluster sizes, we see that vsearch tends to generate larger clusters (clusters with more binned reads in them). It should also be noted that the largest cluster for vsearch had 2779 reads in it; whereas the largest for usearch is 486 reads). So, it looks like vsearch is better able to pick out centroids that are within 97%ID of a larger number of reads than is usearch.

It should be noted at this point that, although iddef 4 was specified for both clustering methods, it seems like behind the scenes they aren't 100% equal. For example, in the cases of wildcard bases (e.g. N) usearch doesn't count these as miss-matches - however it looks like vsearch does. Therefore, if you're reads (query or subject) contain degenerate bases, one source of different clustering bins comes from the fact that the %ID match can be quite different between the two methods. (I've seen cases where usearch calculated 98%ID but vsearch calculated it as 92%ID). This is why I removed any reads with degenerate bases when modifying the green genes database.

Next, I investigated (by "hand") some of the unique clusters. For example, vsearch identified a cluster with centroid 1081427 (ID from the ref DB); on the usearch side, this read was lumped into the cluster with centroid 4317881. When I check which reads were lumped into the vsearch cluster (with centroid 1081427) I find the ID 4317881. When I blast these two reads together (1081427 & 4317881) I see they have 97.25% ID. So it appears in this case, that potentially due to a different sorting order, different seeds were chosen for essentially the same cluster.

When I check, for every unique vsearch cluster, the ID match to the read contained in that cluster that was chosen as the centroid on the usearch side, I find that they are all >97%ID. The same goes for the other way around (starting with those centroids that are unique to usearch and checking the vsearch counterpart).

Reply all

Reply to author

Forward

pick_de_novo_otus does not ensure ID% < 97

Constantino Schillebeeckx

Colin Brislawn

Constantino Schillebeeckx

Colin Brislawn

Constantino Schillebeeckx

Colin Brislawn

Constantino Schillebeeckx