usearch_global not matching even at lower identity

142 views
Skip to first unread message

Kevin Wamae

unread,
May 3, 2021, 2:28:25 AM5/3/21
to VSEARCH Forum
Hi, I've got two primers (db1.fasta) that I'm trying to match to genomic DNA (db2.fasta).

primer1 is 100% identical to the genomic DNA while primer2 has a 5 bp difference so I set my --id to 0.5 to see if I can get a match and somehow it doesn't get a hit.

I keep getting "Matching unique query sequences: 1 of 2 (50.00%)"

Below is my script:
vsearch \
  --usearch_global db1.fasta \
  --db db2.fasta \
  --strand both \
  --id 0.5 \
  --minseqlength 1 \
  --alnout alnout.txt \
  --maxaccepts 100 \
  --maxrejects 100

>primer1
GGTGATATTGTAAGAGGAA
>primer2
GGAGATATTATACGCGGCA

>gDNA
GCAAAACGTGAAGGAGAATCTATTGTGAATAGTCAAGCAAATAATGGAACGTTAAATGTA
TGTACTGCACTTGCACGAAGTTTTGCAGATATAGGTGATATTGTAAGAGGAACAGATCTT
TTCCTTGGTGGTCCTAGTCAAGAGAAAAAAAAATTAGAAGAAAATCTGAAAAAAATATTT



Colin Brislawn

unread,
May 3, 2021, 3:59:19 PM5/3/21
to VSEARCH Forum
Hello Kevin,

I'm not one of the vsearch devs, but I think I know why this happened.

From the manual:
--id real
Reject the sequence match if the pairwise identity is lower than real (value ranging from 0.0 to 1.0 included). The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. That efficient pre-filtering also prevents pairwise alignments with weakly matching targets, as there needs to be at least 6 shared kmers to start the pairwise alignment, and at least one out of every 16 k-mers from the query needs to match the target. Consequently, using values lower than --id 0.5 is not likely to capture more weakly matching targets. The pairwise identity is by default defined as the number of (matching columns) / (alignment length - terminal gaps). That definition can be modified by --iddef.  

primer2     GGAGATATTATACGCGGCA
gDNA subseq GGTGATATTGTAAGAGGAA
mismatches  ..T......G..A.A..A.

Because there are so many edits in such a short sequence, it reduces the number of shared kmers under six, so the full alignment is never performed.

This kmer filter works well for higher identity or longer sequences, but aligning very short sequences with lots of edits is really hard!

You could try using the --cut command for restriction site cutting, but I'm not sure if that would have the same issue with shared kmers. 

Colin

Kevin Wamae

unread,
May 3, 2021, 4:23:19 PM5/3/21
to VSEARCH Forum
Much appreciated Colin, your observation pointed me to this flag in the manual:

--minwordmatches: non-negative integer Minimum number of word matches required for a sequence to be considered further. Default value is 12 for the default word length 8. For word lengths 3-15, the default minimum word matches are 18, 17, 16, 15, 14, 12, 11, 10, 9, 8, 7, 5 and 3, respectively. If the query sequence has fewer unique words than the number specified, all words in the query must match. If the argument is 0, no word matches are required.

By lowering this number to 0, I found a match at 73% identity, which is exactly what I was looking for. Thanks again.

Qry   1 + GGAGATATTATACGCGGCA 19
          || |||||| || | || |
Tgt  94 + GGTGATATTGTAAGAGGAA 112

19 cols, 14 ids (73.7%), 0 gaps (0.0%)

Torbjørn Rognes

unread,
May 4, 2021, 8:31:08 AM5/4/21
to VSEARCH Forum
I am happy to see that you solved the issue yourself. With short sequences, it is necessary to adjust the word length down with the "--wordlength" option, and/or adjust the "--minwordmatches" down too, as you did. The default options are targeted at more typical amplicon sequence length in the range of a couple of hundred bases. Perhaps VSEARCH should take the length of the query sequence into consideration and automatically adjust these parameters accordingly.

Thanks to Colin for pointing you in the right direction!

- Torbjørn

Kevin Wamae

unread,
May 4, 2021, 11:48:31 AM5/4/21
to VSEARCH Forum
Thanks Torbjørn for pointing that out.

I think if vsearch would automatically adjust these parameters based on sequence length (like you highlighted) that'd be helpful.
Reply all
Reply to author
Forward
0 new messages