vsearch --usearch_global N streches

87 views
Skip to first unread message

gbbio

unread,
Feb 10, 2022, 8:22:01 AM2/10/22
to VSEARCH Forum
Hello,

I am trying to test if primers have a good match on a genome and runned into a problem I don't fully understand. I got one primer with output:

id=100.0
alnlen=23
mism=0
qrow=nnnnnnnnnnnnnnnnnnnnaGG
trow=GGCATGAACGATACCGATTAAGG

The database contains the primer, so the query is the genome. I think vsearch changed N to n because of dust masking. How can I avoid this kind of matches?
I have tried to set --minwordmatches to 0,1,2,3,4 but there is no difference. 

Colin J Brislawn

unread,
Oct 11, 2023, 2:29:53 PM10/11/23
to VSEARCH Forum
Vsearch supports --qmask to mask query sequences and --dbmask to mask database sequences.

To disable, try running with
--qmask none --dbmask none

P.S. Vsearch ships a very detailed manual. You can download a copy from the releases.
Colin

Frédéric Mahé

unread,
Oct 30, 2023, 2:07:32 PM10/30/23
to VSEARCH Forum
I don't think it is possible to filter N-rich cases like the one you describe during the search. However, it is possible to collect data and to identify such cases once the search is done.

When aligning sequences, identical symbols will receive a match score (default +2). Aligning a pair of symbols where at least one of them is an ambiguous symbol (BDHKMNRSVWY) will always result in a score of zero.

So, for N-rich queries, the raw score should be low when compared to the alignment length. It is possible to collect these data for each query with the --userout output option and --userfields. Here is a toy-example:

vsearch \
    --usearch_global <(printf ">query1\nNNNNNNNNNNNNNNNNNNNNNGG\n") \
    --db <(printf ">target1\nGGCATGAACGATACCGATTAAGG\n") \
    --quiet \
    --minseqlength 23 \
    --id 1.0 \
    --userfields query+alnlen+ids+raw \
    --userout -

query1 23 23 4

Here the alignment length is 23, the number of matches is 23, and yet the raw score is only 2, indicating an alignment with 21 ambiguous symbols.

As a side note, please keep in mind that vsearch implements a global-pairwise alignment method. If a primer sequence has several matching positions in a longer sequence (a genome), only one will be reported.

gbbio

unread,
Nov 23, 2023, 7:31:24 AM11/23/23
to VSEARCH Forum
@Colin J Brislawn
Oh thanks! I was never aware of the manual.

@Frédéric Mahé
I am not sure about your side-note, I can see multiple hits with --usearch_global for a genome/primer. (One of the reasons I choose vsearch to do this.)




Op maandag 30 oktober 2023 om 19:07:32 UTC+1 schreef Frédéric Mahé:
Reply all
Reply to author
Forward
0 new messages