Hi Peter,
This happens because the usearch_global (and cluster_size) algorithm is a heuristic algorithm, not an exact algorithm. It may fail to find sequences that are 97% identical to the query. Therefore, a small fraction of the sequences will not match; that's normal.
To make them fast, these algorithms take some shortcuts. When searching or clustering, the algorithms will consider potentially matching sequences based on the number of common 8-mers in the query and database sequences, starting with those that have the highest number of common 8-mers. Only a limited number of these sequences will be considered in detail and fully aligned with an exhaustive alignment algorithm, because it takes a long time. You can adjust the number of sequences considered in detail with the options "-maxrejects" and "-maxaccepts". The defaults are 32 and 1, respectively. When considering the potentially matching database sequences vsearch will stop when 32 sequences have been considered that do not match (at least 97%) or when 1 matching sequence has been found. If you increase "-maxrejects" to 100 or 1000 the chance of missing a match is much lower. You can even set it to 0 to make it consider all sequences until it finds a match. It also stops when the first database sequence matching (at least 97%) is found, but there might be others that match as well or better. To make vsearch consider more sequences you can increase "-maxaccepts" as well, but then you should perhaps also consider using the "-maxhits" option to limit the number of hits reported. If you increase "maxaccepts" and "maxrejects" the program will slow down. The same algorithms are used in usearch.
When usually happens is that there are many other sequences that are around 96% similar to those query sequences and only one that is a little above 97% similar. Due to the heuristics, vsearch will consider 32 of those sequences that are below 97% first and wont find the sequence that is just above 97% similar. The heuristics based on common 8-mers is not precise enough to order these sequences correctly.
So, if you want to make searching and clustering more accurate and reduce this problem, you could try to increase maxrejects to e.g. 100 or 1000.
I hope this was understandable.
- Torbjørn