Taxonomy data missing

UIMOL3

unread,

Mar 23, 2016, 7:28:14 PM3/23/16

to CLARK Users

I write this messeage because when I run the command ./set_targets.sh with my Custom Database ,( ./set_targets.sh DIR_DB/Custom/ DIR_DB/virus --species) the software show me a message that said " taxonomy data or files missing" , instant start to download an archive called “gi_taxid_nucl.dmp.gz (1.3 Gb) and when finish download another called "taxdum.tar.gz". Finished the software show me this " Failed to recognize the database" and show me the instructions for do an analysis against viruses , human and bacteria or Custom

I don´t know where is my mistake , because I do the same steps from the manual , I created the directory of DIR_DB in the directory where is found the software, and after that the sub directory of Custom , where my custom sequences are in .fna format from databases like mammals , protozoa , vertebrate , etc ( the database are complete assembly genomes).

So , I send this message asking help because I don´t know what to do.

Thanks

Greetings.

Juan C

Rachid Ounit

unread,

Mar 26, 2016, 4:24:44 PM3/26/16

to UIMOL3, CLARK Users

Hello Juan,

Thank you for reporting this issue. Can you please send me the complete log printed by the program? In addition, it will help if you may detail some more what are the sequences you used in your custom database (where do they come from? Do they have a gi number in the header with the RefSeq format? If they do have a gi number in the RefSeq format, then can you share them with us so we can reproduce this issue on our side?)

Cheers,
Rachid

--
You received this message because you are subscribed to the Google Groups "CLARK Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clarkusers+...@googlegroups.com.
Visit this group at https://groups.google.com/group/clarkusers.
To view this discussion on the web visit https://groups.google.com/d/msgid/clarkusers/f2c447ae-59ae-4f72-8fcd-5c7bbc2f86b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

UIMOL3

unread,

Mar 27, 2016, 5:13:32 AM3/27/16

to CLARK Users

Noticeable Rachid

Hi , thanks for the answer , well my custom data come from the genomes/refseq directory of the NCBI FTP site ( fungi, plants, mammals, vertebrate, invertebrate and protozoa), and I downloaded using the linux commands that recommend the NCBI manual ( don´t worry I will share the PDF) , well after read a bite more the manual , I don´t know if when I write "Custom" instead "custom"after the DIR_DB , affect the command operation?

Other questions...I tried to download human database , but the program cannot download ....where is the problem?. And in the other question is about the second step in
CLARK-S when you mention

Step 1: If the discriminative 31-mers of the database you have defined in step 0 does not exist, then run:

$ ./classify_metagenome.sh -O sample.fa -R result
where sample.fa is some fasta file data (this command line matters primarily to create the discriminative 31-mers).

The O is the same for

Step "3".- To classify your metagenome (for example, A1.10.1000.fq) using CLARK-S, you need to select the spaced mode, thanks to the option "-m 4":
$ ./classify_metagenome.sh -O A1.10.1000.fq -R result.A1 -m 4

Using the example that you give , the command line is like this?

$ ./set_targets.sh DIR_DB bacteria viruses human --genus
$ ./classify_metagenome.sh -O A1.10.1000.fq -R result.A1
$ ./buildSpacedDB.sh
$ ./classify_metagenome.sh -O A1.10.1000.fq -R result.A1 -m 4

Thanks a lot for the help I will wait your answers

Greetings

Juan C.

HowTo_Downloading_Genomic_Data.pdf

Diapositiva2.JPG

Diapositiva3.JPG

Diapositiva4.JPG

Diapositiva5.JPG

UIMOL3

unread,

Mar 28, 2016, 7:44:47 PM3/28/16

to CLARK Users

Hi again , well I reinstall the software and copied my custom sequence dataset in the custom subdirectory of DIR_BD directory ( 611 genomes from the NCBI FTP site) and executed again the software , following the same instructions from the manual , but again the analysis failed . So I upload a ppt where I paste all screens with all steps that I made. I hope that you can guide me where is the issue.

My objective is know the kind of contamination have a sample of ancient DNA , separately of bacteria , virus and human contamination. That is the reason that I download
all the available genomes but it seems that sequences that I downloaded the software cannot find or make the mapped files of taxonomy.

Thanks a lot. I hope your answer

Juan C

Pd.- In case that I downloaded wrong sequences where I can download full dataset of another kingdoms (like the software do for bacteria , viruses and human)?

CLARK pipeline.pptx

Rachid

unread,

Mar 30, 2016, 12:56:14 AM3/30/16

to CLARK Users

Hello Juan,

Thank you for the detailed response! As I mentioned, the sequences in the custom folder ("Custom/") should have in the gi number in their header.

You can look at the sequences of bacteria that you downloaded to get an example.

For instance, ">gi|158333233|ref|NC_009925.1|. It seems your sequences do not have it.

If your sequences do not have a GI number then I would recommend to work directly with the CLARK executable in the folder "exe/" (so without the two scripts, "set_targets.sh" and "classify_metagenome.sh").

See in the README file, the section "LOW-LEVEL DESCRIPTION & EXAMPLES".

What you would need to do is to create the two-column file, called "targets.txt". That's what set_targets.txt does for you when you work with bacteria/viruses sequences from RefSeq.

This targets.txt file, as described in the README file, defines the targets CLARK uses for the classification:

The first column contains addresses of your sequences (bacteria, viruses, eukaryotes, etc.) and the second column contains the related taxonomy id at the species level.