Hi Itai
Thank you for reporting this issue.
You have identified a problem with the heuristics of VSEARCH that appears with short sequences, especially when much of the sequences are repetitive. It can be circumvented by adjusting some options, but VSEARCH should perhaps be improved to avoid this problem.
It is a bit complex, but I will try to explain why it happens.
When VSEARCH is searching or clustering it initially identifies the most promising hit candidates based on shared k-mers between the sequences. By default the k-mers are 8 bp long. The k-mers selected for this purpose are the distinct k-mers found in the non-masked regions of the sequences. By default at least 12 k-mers have to be shared for a hit to be identified. If there are fewer than 12 distinct k-mers in the query sequence, all k-mers in the query needs to match. These are the heuristics used that generally work well, but not in this case.
If we mask the input sequences using the dust algorithm we get the following (lowercase letters are masked):
$ vsearch --fastx_mask vsearch_input.fasta --fastaout - -quiet
>a;size=10;
ATTGATTGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>b;size=2;
ATTGATTGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaC
>c;size=1;
ATTGATTGAAAAGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>d;size=1;
ATTGATTGaaaaagaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>e;size=1;
ATTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatttaaaaaa
(I have named the sequences a, b, c, d, and e for easier reference.)
Much of the sequences is masked due to the repeated A’s.
The distinct 8-mers found in the non-masked regions are:
a) ATTGATTG
b) ATTGATTG
c) ATTGATTG, TTGATTGA, TGATTGAA, GATTGAAA, ATTGAAAA, TTGAAAAG
d) ATTGATTG
e) -
When clustering, the first sequence (a) immediately forms the centroid of the first cluster.
The second sequence (b) is then searched against the centroids (only a) based on the k-mers. Since there is one common k-mer a hit is identified. Sequence b is then aligned with sequence a and it is identified as more than 90% similar and therefore clustered with a.
The problem appears with sequence c. It has 6 distinct k-mers, but only one of them is shared with a. That is considered insufficient and no hit is identified. A new cluster is formed with sequence c as the centroid.
Sequence d shares one k-mer with sequence a and is clustered with a.
Sequence e has no distinct non-masked k-mers. In this case it is aligned to all the centroids (a and c), but none are 90% similar and e therefore forms the centroid of a new cluster.
One can avoid this problem by setting the vsearch option minwordmatches to 0 when dealing with short sequences or with sequences with lots of low complexity (masked) regions. It will effectively turn off the k-mer-based heuristics. It will make vsearch slower, but if the sequences are few and short, it may not matter so much.
VSEARCH should probably be modified so that fewer k-mer matches are required for query sequences with few distinct k-mers.
I do not know exactly how this is handled in USEARCH as it is not documented in detail.
I have added this as an issue on GitHub.
- Torbjørn