Unrecognized string Error

Nayla Abney

unread,

Mar 2, 2021, 3:50:09 PM3/2/21

to VSEARCH Forum

Hi guys,

I am trying to cluster 5000 sequences into a multiple groups. I have a compiled file with all of the sequences. This is the code that I am using:

vsearch --cluster_fast seqdump_1.txt --clusters --id 1

The txt file is in FASTA format. However, when I run this code, I get the error:

Unrecognized string on command line (1)

This is the first few lines of the txt file:

>NP_001189784.1 cytochrome P450 3A4 isoform 2 [Homo sapiens]

MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQP

I would appreciate the help!

Thanks!

Colin Brislawn

unread,

Mar 2, 2021, 5:15:59 PM3/2/21

to VSEARCH Forum

Hello Nayla,

It looks like that might be an amino acid sequence, and vsearch only works with nucleic acids. Unfortunately, will have to use another program to cluster your proteins.

Colin

Nayla Abney

unread,

Mar 2, 2021, 5:43:44 PM3/2/21

to VSEARCH Forum

Oh, okay. Thank you for the information. Do you know of any programs for clustering protein sequences?

Colin Brislawn

unread,

Mar 2, 2021, 6:18:58 PM3/2/21

to VSEARCH Forum

I've used MMSeqs2 and quite liked it!

https://github.com/soedinglab/MMseqs2

Colin

Frédéric Mahé

unread,

Mar 7, 2021, 9:52:19 AM3/7/21

to VSEARCH Forum

Hi Nayla,

the error message concerns your command line. A filename or '-' is missing after --clusters. This would run:

printf ">s1\nAA\n>s2\nTT\n" | vsearch --cluster_fast - --clusters - --id 1

Now if you were to process amino-acid sequences, you would get a different warning:

printf ">s1\nMALIPD\n>s2\nMALIPD\n" | vsearch --cluster_fast - --clusters - --id 1

WARNING: 6 invalid characters stripped from FASTA file: I(2) L(2) P(2)

Reply all

Reply to author

Forward