why not 100% matching query sequences

Peter Pellitier

unread,

Mar 2, 2021, 3:57:38 PM3/2/21

to VSEARCH Forum

Hi vsearch,

I have clustered sequences into 97% OTU (centroids derived from the 'cluster_size' command), and then used usearch_global to 'map' all sequences agains this 'representative sequence' database, also at 97%. I am getting about 99.95% of sequences matching my query database. However, I imagine it could/should be 100%. If an OTU is clustered, in the first step, and only represents itself as a singleton, why shouldn't the same sequence match perfectly to it at 100% in the usearch_global step. In that way, 100% of the query sequences should hit. I am failing to understand something key here. Some of the sequences were removed due to short length, perhaps this is contributing? Thanks for the help, and how to tweak the commands.

Peter

Torbjørn Rognes

unread,

Mar 4, 2021, 4:29:25 AM3/4/21

to VSEARCH Forum

Hi Peter,

This happens because the usearch_global (and cluster_size) algorithm is a heuristic algorithm, not an exact algorithm. It may fail to find sequences that are 97% identical to the query. Therefore, a small fraction of the sequences will not match; that's normal.

To make them fast, these algorithms take some shortcuts. When searching or clustering, the algorithms will consider potentially matching sequences based on the number of common 8-mers in the query and database sequences, starting with those that have the highest number of common 8-mers. Only a limited number of these sequences will be considered in detail and fully aligned with an exhaustive alignment algorithm, because it takes a long time. You can adjust the number of sequences considered in detail with the options "-maxrejects" and "-maxaccepts". The defaults are 32 and 1, respectively. When considering the potentially matching database sequences vsearch will stop when 32 sequences have been considered that do not match (at least 97%) or when 1 matching sequence has been found. If you increase "-maxrejects" to 100 or 1000 the chance of missing a match is much lower. You can even set it to 0 to make it consider all sequences until it finds a match. It also stops when the first database sequence matching (at least 97%) is found, but there might be others that match as well or better. To make vsearch consider more sequences you can increase "-maxaccepts" as well, but then you should perhaps also consider using the "-maxhits" option to limit the number of hits reported. If you increase "maxaccepts" and "maxrejects" the program will slow down. The same algorithms are used in usearch.

When usually happens is that there are many other sequences that are around 96% similar to those query sequences and only one that is a little above 97% similar. Due to the heuristics, vsearch will consider 32 of those sequences that are below 97% first and wont find the sequence that is just above 97% similar. The heuristics based on common 8-mers is not precise enough to order these sequences correctly.

So, if you want to make searching and clustering more accurate and reduce this problem, you could try to increase maxrejects to e.g. 100 or 1000.

I hope this was understandable.

- Torbjørn

Peter Pellitier

unread,

Mar 4, 2021, 2:16:49 PM3/4/21

to VSEARCH Forum

Torbjørn,

Thank you very much for this informative response. I appreciate your replies and the overall package. It allowed me to cluster more than 1.5 million ASV into OTU in less than 2 hours.

Peter

Reply all

Reply to author

Forward