Format of accession number in header?

56 views
Skip to first unread message

lanthala

unread,
Aug 26, 2016, 9:04:41 PM8/26/16
to CLARK Users
I couldn't find any details in the manual; exactly what format do the accession numbers that Clark can now use to look up taxids need to be in? My fasta headers have GCA_###### in them, following the strain name, but ./set_targets is unable to identify them.

Rachid

unread,
Aug 28, 2016, 10:22:59 PM8/28/16
to CLARK Users
Hello Lanthala,

It is actually explained in the README file in "CLASSIFICATION OF METAGENOMIC SAMPLES", section "1.1) Step I: Setting targets"

The format of the sequences is also detailed in this NCBI webpage (http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/), including the changes from GI number to accession number. So the expected format in the header of fasta file is either:

>gi|XXXXXX|gb|ACCESSION.VERSION| Name..
...

or

>ACCESSION.VERSION
...

The script set_targets.sh can deal with both cases automatically. So it seems you want to use CLARK with customized/specific sequences? If so, please make sure these fasta files have one of the two format described above.

Best,
Rachid


On Friday, August 26, 2016 at 9:04:41 PM UTC-4, lanthala wrote:

lanthala

unread,
Aug 31, 2016, 8:29:48 PM8/31/16
to CLARK Users
Thank you for the information; I couldn't add the correct information to the headers (my genomes don't have NCBI-style accessions), so instead I edited the targets.txt file manually to contain the taxids of each genome (by basically adding the taxid to the targets_excluded.txt file); this is what I used to do, in a previous version of CLARK. Unfortunately, that doesn't seem to let CLARK correctly handle them anymore; when I subsequently try to classify to that database, it says it finds 0 k-mers, and the (edited) targets.txt file is blank once again.

Do you have any suggestions as to how to let CLARK use the taxid information I have?

Rachid OUNIT

unread,
Aug 31, 2016, 9:07:03 PM8/31/16
to lanthala, CLARK Users
Hello Lanthala,

Yes, there is a way but it seems you have already built the targets definition. If you have it already then you do not need to run the scripts "set_targets.sh" (because it will erase it). You can call directly the classifiers executable (CLARK, CLARK-l and CLARK-S) located in the folder "exe".
So the command line (assuming you have the targets definition in "mytargets.txt") is something like:
$ ./exe/CLARK -T mytargets -D <DatabaseDirectory> -k 31 -O file.fa -R res ...
 
So it seems in your case you can run the CLARK classifiers without the need to use "set_targets.sh".

Best,
Rachid

--
You received this message because you are subscribed to the Google Groups "CLARK Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clarkusers+unsubscribe@googlegroups.com.
To post to this group, send email to clark...@googlegroups.com.
Visit this group at https://groups.google.com/group/clarkusers.
To view this discussion on the web visit https://groups.google.com/d/msgid/clarkusers/890f9ca7-22db-4153-b1f5-da1523171e71%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages