Custom database creation

Shareef Dabdoub

unread,

Mar 7, 2017, 2:10:54 PM3/7/17

to CLARK Users

I've been preparing a custom database for use with CLARK-S and I had a few questions after going through the README.

1.) How are accession numbers in the FASTA data used? The majority of the genomes I'm using have GenBank accession IDs, but a few only have JGI GOLD IDs or no standard database ID yet. In those cases, will it be fine to use any unique identifier?

2.) For targets_addresses.txt the README recommends NCBI taxonomy IDs as the labels for each genome file. It mentions that any label is fine, but I'm wondering if there will be an issue mixing NCBI IDs and, for example, "Genus_species" labels within the same targets_addresses file.

3.) For setting up a custom database, the README specifies "...one fasta file per reference sequence". The database I'm working with provides a single file containing all the FASTA records for all the genomes, most of which are split into multiple contigs. I'm working on processing the file for use with CLARK, and I just wanted to verify whether CLARK will expect one file per contig, or if I could group multiple contigs for one genome in the same file.

Thanks,

Shareef

Rachid

unread,

Mar 8, 2017, 7:40:51 PM3/8/17

to CLARK Users

Hello Shareef,

Please see my answers below. I hope it helps!

Best,

Rachid

On Tuesday, March 7, 2017 at 2:10:54 PM UTC-5, Shareef Dabdoub wrote:

I've been preparing a custom database for use with CLARK-S and I had a few questions after going through the README.

1.) How are accession numbers in the FASTA data used? The majority of the genomes I'm using have GenBank accession IDs, but a few only have JGI GOLD IDs or no standard database ID yet. In those cases, will it be fine to use any unique identifier?

Accession numbers are used to taxonomically identify reference sequences (using the NCBI taxonomy tree). The script "set_targets.sh" does it automatically for you when you work with the provided databases.

If you want to work with custom database that do not have any accession number or do not follow the header format of NCBI/RefSeq then you can follow one these options:

1) I suggest you redefine for each fasta file the header in a manner similar/consistent with the NCBI/RefSeq sequences (Cf. README file). For example:

">accession.number ..." or ">gi|number|ref|accession.number| ..."

If you know the species name or species taxonomy ID of your JGI sequences then can you look up for the NCBI accession number? Once you have the accession number you can override the header of these sequences with it through the format mentioned.

Once you have reformatted these files, you copy/move them in the custom directory and run "set_targets.txt" (cf. https://groups.google.com/d/msg/clarkusers/6zOIFt_0elk/0paAbAiaDwAJ).

I recommend you take this option.

2) You can build directly the targets definition (file "targets.txt" in your database directory). This is a two-column file indicating for each reference sequence its filename (file address in your disk) and the species id (a number, a name, etc.). See more about it in the README file or https://groups.google.com/d/msg/clarkusers/QoL2I1q0Udc/nBbnfjjADQAJ)

2.) For targets_addresses.txt the README recommends NCBI taxonomy IDs as the labels for each genome file. It mentions that any label is fine, but I'm wondering if there will be an issue mixing NCBI IDs and, for example, "Genus_species" labels within the same targets_addresses file.

When you define your targets, you define them for a given taxonomic rank (species by default if you can), so one identifier (a number, a text, etc.) is sufficient.

But what do you mean by "mixing NCBI IDs"? Can you provide a real case for such a situation ?

3.) For setting up a custom database, the README specifies "...one fasta file per reference sequence". The database I'm working with provides a single file containing all the FASTA records for all the genomes, most of which are split into multiple contigs. I'm working on processing the file for use with CLARK, and I just wanted to verify whether CLARK will expect one file per contig, or if I could group multiple contigs for one genome in the same file.

multiple contigs per fasta file are fine as long as they belong to the same taxon (defined for a given taxonomic rank).

Thanks,
Shareef

Shareef Dabdoub

unread,

Mar 24, 2017, 3:22:47 PM3/24/17

to CLARK Users

Hi Rachid,

Thank you very much for your answers, they were very helpful. Some followup and responses to your questions:

If you know the species name or species taxonomy ID of your JGI sequences then can you look up for the NCBI accession number? Once you have the accession number you can override the header of these sequences with it through the format mentioned.

Some of the genomes I mentioned have NCBI BioProject numbers, but no accession numbers or taxonomy IDs.

When you define your targets, you define them for a given taxonomic rank (species by default if you can), so one identifier (a number, a text, etc.) is sufficient.
But what do you mean by "mixing NCBI IDs"? Can you provide a real case for such a situation ?

I was considering the situation described above. Most of the genomes have NCBI identifiers and would thus be fine to run through the set_targets.sh. But even if I constructed the targets file myself with the appropriate NCBI IDs for each genome file, could I still include the genomes with no NCBI IDs by manually specifying their taxonomic strings in the targets file? So for example:

./genomes/genome1.fa  NCBI_ID_1
./genomes/genome2.fa  NCBI_ID_2
./genomes/genome3.fa  Genus_species
./genomes/genome4.fa  NCBI_ID_3

Which is what I meant by mixing NCBI IDs and alternate identifiers.

Thanks,

Shareef

Shareef Dabdoub

unread,

Mar 24, 2017, 3:24:58 PM3/24/17

to CLARK Users

But even if I constructed the targets file myself...

Or alternatively, if I let set_targets.sh make the targets file, could I then supplement it with the genomes that do not have NCBI IDs

Rachid OUNIT

unread,

Mar 29, 2017, 5:43:11 PM3/29/17

to Shareef Dabdoub, CLARK Users

See my answers below,

Cheers,

I strongly do not recommend it, as it is important to keep a consistent targets naming/labeling system (please see the CLARK publications). But doing so, you may mix genomes from the NCBI database that are identical or similar and the program will consider them as different systematically leading to a flawed database creation (i.e., a "multiple naming" issue)...

Here is a serious problem that can occur if you follow with this idea. You may introduce a genome that is a strain of E. coli that is or is not in the NCBI database and by giving it a distinct label and not "562" as it should (so the system recognizes that it is a strain that belong to the species E. coli with all other known E. coli strain in the NCBI database). So the program will not know that this genome you want to introduce should be associated with the unknown genomes of E. coli. Then because of this "multiple naming" issue, during the database construction, every k-mer appearing in two or more distinct targets will be removed, so any k-mers that are common between that introduced strain of E. coli would "kill" any k-mers that exist in the known E. coli genomes, which would lead to an abnormally low number of k-mers specific for E. coli...

However, if you do have perfect knowledge of all genomes in your custom database and assign identifiers that are consistently identical per taxon (independently of where they come from) at the species level for example then you can proceed as you described. And you can alter the targets definition file once created (i.e., "targets.txt" file created by set_targets.sh) but again you need to be absolutely sure that the identifiers you are adding won't create the problem of "multiple naming" or that they won't confuse you later on, especially if you use numbers (so my suggestion is to use identifiers with constant prefix such as "ALTID", like "ALTIDXXXX").

Hope this helps,

Best,

Rachid

Thanks,
Shareef

--
You received this message because you are subscribed to the Google Groups "CLARK Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clarkusers+unsubscribe@googlegroups.com.
To post to this group, send email to clark...@googlegroups.com.
Visit this group at https://groups.google.com/group/clarkusers.
To view this discussion on the web visit https://groups.google.com/d/msgid/clarkusers/37daf4eb-b16f-45d9-90f0-23d87ed15322%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shareef Dabdoub

unread,

Mar 30, 2017, 2:28:20 PM3/30/17

to CLARK Users, sha...@dabdoub.net

Ah, ok, sure that makes sense. Thanks.

One followup question. With regard to having multiple contigs in the same file. Would it be a problem if they each had their own unique GenBank accession IDs?

Shareef

Thanks,
Shareef

To unsubscribe from this group and stop receiving emails from it, send an email to clarkusers+...@googlegroups.com.

To post to this group, send email to clark...@googlegroups.com.
Visit this group at https://groups.google.com/group/clarkusers.

Rachid

unread,

Mar 30, 2017, 10:59:08 PM3/30/17

to CLARK Users, sha...@dabdoub.net

Yes, it would be a problem because it contradicts the assumption that each file in the database contains sequences for only one taxon (cf. README file).

Reply all

Reply to author

Forward