Building a SILVA custom database

Joany

unread,

Jan 10, 2016, 1:46:42 AM1/10/16

to CLARK Users

Hello CLARK Users:

I am currently trying to build a custom database from SILVA_123_SSURef ( http://www.arb-silva.de/no_cache/download/archive/release_123/Exports/) FASTA sequences. I believe this is my only option due to my sequences. I have whole genomic DNA extracted from stromatolites (an environmental sample) and according to the manual:

If you analyze a metagenomic sample from a poorly known microbial habitat (i.e.,the RefSeq database does not to contain genomes of organisms present in your sample), for example, sea water, ocean, etc. then use the spaced mode ("-m 4").

In order for me to run the spaced mode, I am attempting to run the ./set_targets.sh command with a custom database, yet I received the following output:

If you want CLARK to use a customized database then please do the following directions:
1) Move your sequences (each fasta file defined with a GI number) to /home/someFolder

The issue is that SILVA does not contain GI numbers because they have reconstructed the NCBI taxonomy and their FASTA header looks like this (as an example):

>AB001038.1.1721 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Chlamydomonadales;Polytoma;Chlamydomonas pulsatilla

Is there a way around this? For example, is there a way to add the GI numbers back into the headers so that CLARK-S is compatible with the SILVA (or even Greengenes) database?

Thank you very much!
Joany

Rachid

unread,

Jan 12, 2016, 3:51:56 PM1/12/16

to CLARK Users

Dear Joany,

You are right, the provided scripts for setting the database (set_targets.sh) and classifying samples (classify_metagenome.sh) are specific for the NCBI/RefSeq databases. In your case, you want to use a different database.

But you can still use CLARK, if you do not use these scripts but rather use directly the executable "CLARK" located in the "exe" folder (created at the installation).

To use CLARK with data from the SILVA database, you would need to do the following directions (I am assuming you want to work at the species level):

1) Download all the files/sequences into a specific directory,

2) Extract every sequences from each file downloaded and store each of these sequences into separate file into a specific directory, called "DIR_DB"

3) Build a two-column file "targets.txt": the first column contains all filenames related to the sequences stored in DIR_DB and the second column has all the identifiers (or scientific names, and for each name, all words are concatenated with a '-' or "_" for example) associated to each sequence in DIR_DB. For example, from the terminal:

$ cat targets/.txt

<DIR_DB/SEQUENCE1> <ID_1>

<DIR_DB/SEQUENCE2> <ID_2>

<DIR_DB/SEQUENCE3> <ID_3>

...

Note that <ID_1> and <ID_2> can be the same identifier if <SEQUENCE1> and <SEQUENCE2> have the same identifier at the species level.

I believe that in order to do this, you need to use the SILVA taxonomy definitions.

These steps are actually the main steps done by the script "set_targets.sh" but for NCBI/RefSeq sequences.

Then run CLARK, to classify, say the sample <sampleA.fa> and store the results in <resultsA>, with default settings:

$ ./exe/CLARK -T targets.txt -D DIR_DB -O <sampleA.fa> -R <resultsA>

It will build the database first if it has not been created yet.

Cheers,

Rachid

Vishal Koparde

unread,

Dec 16, 2016, 3:39:31 PM12/16/16

to CLARK Users

How do we estimate abundance after that?

-Vishal

Rachid OUNIT

unread,

Dec 16, 2016, 8:18:17 PM12/16/16

to Vishal Koparde, CLARK Users

Hello Vishal,

Once you have the results files created by CLARK with the SILVA database, you can use the script "estimate_abundance.sh", as you would for any other results file created using the default database.

Cheers,

Rachid

--
You received this message because you are subscribed to the Google Groups "CLARK Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clarkusers+unsubscribe@googlegroups.com.
To post to this group, send email to clark...@googlegroups.com.
Visit this group at https://groups.google.com/group/clarkusers.
To view this discussion on the web visit https://groups.google.com/d/msgid/clarkusers/7617bc5c-a152-4a1c-93cf-255b2abb6111%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

RP

unread,

Jan 12, 2017, 3:58:40 PM1/12/17

to CLARK Users

Hey Joany,

You could also try grepping out the gi from ncbi's taxonomy list.

Download the list here (its a tsv: accession<\t>accession.version<\t>tax ID<\t>GI):

wget "ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz"

Then grep the accession.version:

grep -P "\tAB001038.1\t" accession2taxid/nucl_gb.accession2taxid

Then use sed or something to replace headers in your SILVA reference fastas.

Borja Rojas

unread,

Aug 6, 2018, 7:15:04 PM8/6/18

to CLARK Users

Hi Joany,

Could you share how you resolved this? I am trying to carry out an analysis like this but I am stucked in database creation. Or even a link to download a 16S database for CLARK.

Any help will be helpfull for me.

Thank you for your consideration,

Borja

Reply all

Reply to author

Forward