VSEARCH clustering fails for mutations in the beginning of the sequence

222 views
Skip to first unread message

Itai Rusinek

unread,
Aug 8, 2018, 12:19:59 PM8/8/18
to VSEARCH Forum
Hi,

I've come across a strange problem with VSEARCH clustering. When I introduce a single mutation in a sequence, it only succeeds in clustering it to the correct seed when that mutation is at the end of the sequence, but not in the beginning. To be more precise, I used sequences of length 51, and from the tests I've done point mutations in bases 1-13 are not correctly clustered to their seed, but point mutations in bases 14-51 are. I ran USEARCH for comparison on the same file and it didn't fail.

It's such a strange problem that I imagine it's likely that I am missing something in the parameter definitions, but couldn't find what it was. I tried all 3 clustering options (smallmem, fast, size) and all qmask options. I also tinkered with the id threshold. None of those seemed to matter.

Below is a reproduction of the error. The input file is one cluster seed and 3 sequences with one point mutation each; they should all be clustered together. I highlighted in bold important lines and in red my remarks.

I'd appreciate any help. If I can't make sense of it I guess I would have to pay for USEARCH.

Thanks,
Itai



itai@calc:/tmp/itai$ cat vsearch_input.fasta
>;size=10;
ATTGATTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  ***seed
>;size=2;
ATTGATTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC 
***point mutation
>;size=1;
ATTGATTGAAAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  ***point mutation
>;size=1;
ATTGATTGAAAAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  ***point mutation
>;size=1;
ATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTAAAAAA
itai@calc:/tmp/itai$ /home/itai/src/vsearch/bin/vsearch --cluster_smallmem vsearch_input.fasta --sizeout --sizeorder --id 0.9 --iddef 1 --uc vsearch.uc --consout vsearch.consout --profile vsearch.pwm --sizein --usersort --qmask dust
vsearch v2.6.2_linux_x86_64, 251.8GB RAM, 56 cores
https://github.com/torognes/vsearch

Reading file vsearch_input.fasta 100%  
255 nt in 5 seqs, min 51, max 51, avg 51
Masking 100%
Counting k-mers 100%
Clustering 100%  
Sorting clusters 100%
Writing clusters 100%
Clusters: 3 Size min 1, max 13, avg 1.7
Singletons: 2, 40.0% of seqs, 66.7% of clusters
Multiple alignments 100%
itai@calc:/tmp/itai$ cat vsearch.uc
S    0    51    *    *    *    *    *    ;size=10;    *
H    0    51    98.0    +    0    0    51M    ;size=2;    ;size=10;
S    1    51    *    *    *    *    *    ;size=1;    *                                                  
******************************************************* why is this a seed? it should be part of the cluster above!
H    0    51    98.0    +    0    0    51M    ;size=1;    ;size=10;
S    2    51    *    *    *    *    *    ;size=1;    *
C    0    13    *    *    *    *    *    ;size=10;    *
C    1    1    *    *    *    *    *    ;size=1;    *
C    2    1    *    *    *    *    *    ;size=1;    *
itai@calc:/tmp/itai$ cat vsearch.consout
>centroid=;seqs=3;size=13;
ATTGATTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>centroid=;seqs=1;size=1;
ATTGATTGAAAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ******************************************************* why is this a centroid? it should be part of the cluster above!
>centroid=;seqs=1;size=1;
ATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTAAAAA


itai@calc:/tmp/itai$ /home/itai/src/usearch/usearch10.0.240_i86linux32 --cluster_fast vsearch_input.fasta --sizeout --id 0.9 --uc usearch.uc --consout usearch.consout --sizein --qmask soft
usearch v10.0.240_i86linux32, 4.0Gb RAM (264Gb total), 56 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: i...@fx9.tech

00:00 40Mb    100.0% Reading vsearch_input.fasta
00:00 6.4Mb  CPU has 56 cores, defaulting to 10 threads
00:00 91Mb   5 seqs (tot.size 15), 5 uniques, 3 singletons (60.0%)
00:00 91Mb   Min size 1, median 1, max 10, avg 3.00
00:00 95Mb    100.0% DB
00:00 100Mb   100.0% 2 clusters, max size 14, avg 7.5
00:00 100Mb   100.0% Building MSAs                  
                                 
      Seqs  5
  Clusters  2
  Max size  14
  Avg size  7.5
  Min size  1
Singletons  1, 20.0% of seqs, 50.0% of clusters
   Max mem  100Mb
      Time  1.00s
Throughput  5.0 seqs/sec.


WARNING: Option -qmask not used

itai@calc:/tmp/itai$ cat usearch.uc
S    0    51    *    .    *    *    *    ;size=10;    *
H    0    51    98.0    +    0    0    51M    ;size=1;    ;size=10;
S    1    51    *    .    *    *    *    ;size=1;    *
H    0    51    98.0    +    0    0    51M    ;size=2;    ;size=10;
H    0    51    98.0    +    0    0    51M    ;size=1;    ;size=10;
C    0    14    *    *    *    *    *    ;size=10;    *
C    1    1    *    *    *    *    *    ;size=1;    *
itai@calc:/tmp/itai$ cat usearch.consout
>Cluster0;size=14;
ATTGATTGAAAAaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAa 
******************************************************* here USEARCH succeeded where VSEARCH failed
>Cluster1;size=1;
ATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTAAAAAA


Torbjørn Rognes

unread,
Aug 15, 2018, 10:50:38 AM8/15/18
to VSEARCH Forum
Hi Itai

Thank you for reporting this issue.

You have identified a problem with the heuristics of VSEARCH that appears with short sequences, especially when much of the sequences are repetitive. It can be circumvented by adjusting some options, but VSEARCH should perhaps be improved to avoid this problem.

It is a bit complex, but I will try to explain why it happens.

When VSEARCH is searching or clustering it initially identifies the most promising hit candidates based on shared k-mers between the sequences. By default the k-mers are 8 bp long. The k-mers selected for this purpose are the distinct k-mers found in the non-masked regions of the sequences. By default at least 12 k-mers have to be shared for a hit to be identified. If there are fewer than 12 distinct k-mers in the query sequence, all k-mers in the query needs to match. These are the heuristics used that generally work well, but not in this case.

If we mask the input sequences using the dust algorithm we get the following (lowercase letters are masked):

$ vsearch --fastx_mask vsearch_input.fasta --fastaout - -quiet
>a;size=10;
ATTGATTGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>b;size=2;
ATTGATTGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaC
>c;size=1;
ATTGATTGAAAAGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>d;size=1;
ATTGATTGaaaaagaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>e;size=1;
ATTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatttaaaaaa

(I have named the sequences a, b, c, d, and e for easier reference.)

Much of the sequences is masked due to the repeated A’s.

The distinct 8-mers found in the non-masked regions are:
a) ATTGATTG
b) ATTGATTG
c) ATTGATTG, TTGATTGA, TGATTGAA, GATTGAAA, ATTGAAAA, TTGAAAAG
d) ATTGATTG
e) -

When clustering, the first sequence (a) immediately forms the centroid of the first cluster.

The second sequence (b) is then searched against the centroids (only a) based on the k-mers. Since there is one common k-mer a hit is identified. Sequence b is then aligned with sequence a and it is identified as more than 90% similar and therefore clustered with a.

The problem appears with sequence c. It has 6 distinct k-mers, but only one of them is shared with a. That is considered insufficient and no hit is identified. A new cluster is formed with sequence c as the centroid.

Sequence d shares one k-mer with sequence a and is clustered with a.

Sequence e has no distinct non-masked k-mers. In this case it is aligned to all the centroids (a and c), but none are 90% similar and e therefore forms the centroid of a new cluster.

One can avoid this problem by setting the vsearch option minwordmatches to 0 when dealing with short sequences or with sequences with lots of low complexity (masked) regions. It will effectively turn off the k-mer-based heuristics. It will make vsearch slower, but if the sequences are few and short, it may not matter so much.

VSEARCH should probably be modified so that fewer k-mer matches are required for query sequences with few distinct k-mers.

I do not know exactly how this is handled in USEARCH as it is not documented in detail.

I have added this as an issue on GitHub.

- Torbjørn

Reply all
Reply to author
Forward
0 new messages