usearch_global not matching even at lower identity

Kevin Wamae

unread,

May 3, 2021, 2:28:25 AM5/3/21

to VSEARCH Forum

Hi, I've got two primers (db1.fasta) that I'm trying to match to genomic DNA (db2.fasta).

primer1 is 100% identical to the genomic DNA while primer2 has a 5 bp difference so I set my --id to 0.5 to see if I can get a match and somehow it doesn't get a hit.

I keep getting "Matching unique query sequences: 1 of 2 (50.00%)"

Below is my script:

vsearch \

--usearch_global db1.fasta \

--db db2.fasta \

--strand both \

--id 0.5 \

--minseqlength 1 \

--alnout alnout.txt \

--maxaccepts 100 \

--maxrejects 100

>primer1

GGTGATATTGTAAGAGGAA

>primer2

GGAGATATTATACGCGGCA

>gDNA

GCAAAACGTGAAGGAGAATCTATTGTGAATAGTCAAGCAAATAATGGAACGTTAAATGTA

TGTACTGCACTTGCACGAAGTTTTGCAGATATAGGTGATATTGTAAGAGGAACAGATCTT

TTCCTTGGTGGTCCTAGTCAAGAGAAAAAAAAATTAGAAGAAAATCTGAAAAAAATATTT

Colin Brislawn

unread,

May 3, 2021, 3:59:19 PM5/3/21

to VSEARCH Forum

Hello Kevin,

I'm not one of the vsearch devs, but I think I know why this happened.

From the manual:

--id real

Reject the sequence match if the pairwise identity is lower than real (value ranging from 0.0 to 1.0 included). The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. That efficient pre-filtering also prevents pairwise alignments with weakly matching targets, as there needs to be at least 6 shared kmers to start the pairwise alignment, and at least one out of every 16 k-mers from the query needs to match the target. Consequently, using values lower than --id 0.5 is not likely to capture more weakly matching targets. The pairwise identity is by default defined as the number of (matching columns) / (alignment length - terminal gaps). That definition can be modified by --iddef.

primer2 GGAGATATTATACGCGGCA

gDNA subseq GGTGATATTGTAAGAGGAA

mismatches ..T......G..A.A..A.

Because there are so many edits in such a short sequence, it reduces the number of shared kmers under six, so the full alignment is never performed.

This kmer filter works well for higher identity or longer sequences, but aligning very short sequences with lots of edits is really hard!

You could try using the --cut command for restriction site cutting, but I'm not sure if that would have the same issue with shared kmers.

Colin

Kevin Wamae

unread,

May 3, 2021, 4:23:19 PM5/3/21

to VSEARCH Forum

Much appreciated Colin, your observation pointed me to this flag in the manual:

--minwordmatches: non-negative integer Minimum number of word matches required for a sequence to be considered further. Default value is 12 for the default word length 8. For word lengths 3-15, the default minimum word matches are 18, 17, 16, 15, 14, 12, 11, 10, 9, 8, 7, 5 and 3, respectively. If the query sequence has fewer unique words than the number speciﬁed, all words in the query must match. If the argument is 0, no word matches are required.

By lowering this number to 0, I found a match at 73% identity, which is exactly what I was looking for. Thanks again.

Qry 1 + GGAGATATTATACGCGGCA 19

|| |||||| || | || |

Tgt 94 + GGTGATATTGTAAGAGGAA 112

19 cols, 14 ids (73.7%), 0 gaps (0.0%)

Torbjørn Rognes

unread,

May 4, 2021, 8:31:08 AM5/4/21

to VSEARCH Forum

I am happy to see that you solved the issue yourself. With short sequences, it is necessary to adjust the word length down with the "--wordlength" option, and/or adjust the "--minwordmatches" down too, as you did. The default options are targeted at more typical amplicon sequence length in the range of a couple of hundred bases. Perhaps VSEARCH should take the length of the query sequence into consideration and automatically adjust these parameters accordingly.

Thanks to Colin for pointing you in the right direction!

- Torbjørn

Kevin Wamae

unread,

May 4, 2021, 11:48:31 AM5/4/21

to VSEARCH Forum

Thanks Torbjørn for pointing that out.

I think if vsearch would automatically adjust these parameters based on sequence length (like you highlighted) that'd be helpful.

Reply all

Reply to author

Forward