Unrecognized string Error

49 views
Skip to first unread message

Nayla Abney

unread,
Mar 2, 2021, 3:50:09 PM3/2/21
to VSEARCH Forum
Hi guys, 

I am trying to cluster 5000 sequences into a multiple groups. I have a compiled file with all of the sequences. This is the code that I am using:

vsearch --cluster_fast seqdump_1.txt --clusters --id 1

The txt file is in FASTA format. However, when I run this code, I get the error:

Unrecognized string on command line (1)

This is the first few lines of the txt file:

>NP_001189784.1 cytochrome P450 3A4 isoform 2 [Homo sapiens]

MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQP

I would appreciate the help!

Thanks!


Colin Brislawn

unread,
Mar 2, 2021, 5:15:59 PM3/2/21
to VSEARCH Forum
Hello Nayla,

It looks like that might be an amino acid sequence, and vsearch only works with nucleic acids. Unfortunately, will have to use another program to cluster your proteins.

Colin

Nayla Abney

unread,
Mar 2, 2021, 5:43:44 PM3/2/21
to VSEARCH Forum

Oh, okay. Thank you for the information. Do you know of any programs for clustering protein sequences?

Colin Brislawn

unread,
Mar 2, 2021, 6:18:58 PM3/2/21
to VSEARCH Forum
I've used MMSeqs2 and quite liked it!

Frédéric Mahé

unread,
Mar 7, 2021, 9:52:19 AM3/7/21
to VSEARCH Forum


Hi Nayla,

the error message concerns your command line. A filename or '-' is missing after --clusters. This would run:

printf ">s1\nAA\n>s2\nTT\n" | vsearch --cluster_fast - --clusters - --id 1

Now if you were to process amino-acid sequences, you would get a different warning:

printf ">s1\nMALIPD\n>s2\nMALIPD\n" | vsearch --cluster_fast - --clusters - --id 1

WARNING: 6 invalid characters stripped from FASTA file: I(2) L(2) P(2)
Reply all
Reply to author
Forward
0 new messages