Creating 16S database really slow

Jussi Volanen

unread,

Jul 9, 2018, 10:59:34 AM7/9/18

to CLARK Users

Hi Rachid,

I'm trying to create 16S database from RDP. I downloaded the Genbank formatted data and created fasta files using the following python script:

from Bio import SeqIO
import sys

input_handle  = sys.stdin

for seq_record in SeqIO.parse(input_handle, "genbank") :
    with open(seq_record.id + ".fasta", "w") as fasta:

fasta.write(">%s\n%s\n" % (seq_record.id, seq_record.seq))

Then I executed set_targets:

./set_targets.sh /db custom

Loading accession number of all files... done (1421655)

Loading merged Tax ID... done

Retrieving taxonomy ID for each file... done (1421387 files were successfully mapped, and 268 unidentified).

custom: Retrieving taxonomy nodes for each sequence based on taxon ID...

Loading nodes of taxonomy tree... done.

Retrieving lineage for each sequence...

and then nothing. I can see that /opt/clark/CLARKSCV1.2.5/exe/getfilesToTaxNodes consumes 100% cpu. How long should I wait? Does it take more than a day?

Best regards

Jussi Volanen

Rachid

unread,

Jul 9, 2018, 11:46:45 AM7/9/18

to CLARK Users

Hi Jussi,

I have worked with the RDP database -- 16S for bacteria and archaea. I noticed several problems in this database: for example, some sequences in the bacteria or archaea database are actually eukaryotes (so there is some inconsistencies in IDs with Genbank).

Also when I ran my analysis, I found that a few sequences are having taxonomy ids with circular definition in their lineage (making the program getfilesToTaxNodes to fall in an infinite loop...) - I was able to find them in the file "<DIR_DB>/.<DB_NAME>.fileToTaxIDs" that is populated on the fly" by checking the last sequence successfully processed to deduce where the program got stuck.

I would suggest to stop the program, identifies these sequences in the file "/db/.custom.fileToTaxIDs" (preventing the program to move forward) There should be a few of them.

Thank you for pointing these issues with RDP database out!

Best,

Rachid

Owen Solberg

unread,

Nov 10, 2019, 5:52:13 PM11/10/19

to CLARK Users

Hi Rachid,

Today I am hitting this exact same behavior after running the suggested setup command: "set_targets.sh DIR_DB bacteria viruses fungi human"

It got stuck processing virus reference sequences. I examined the .virus.fileToTaxIDs file to find the last successfully processed sequence. What exactly do you recommend to overcome this problem? Is it enough to delete the offending fasta file and restart? Do other file(s) need to be edited?

I realize the problem is originating upstream of Clark, but it would be wonderful if Clark could be made somewhat more robust to mangled annotations, which often exist in public databases.

Thanks
owen

Owen Solberg

unread,

Nov 29, 2019, 9:54:28 AM11/29/19

to CLARK Users

Hi again Rachid,

I've found yet another dataset where this same error is coming up.

You say the problem is that of circular lineage definition "making the program getfilesToTaxNodes to fall in an infinite loop." It seems like it would be a relative easy thing to fix, to put a check in that part of the code, where you break out if you see it has looped a large number of times. That way you could issue a warning message to the user and move on.

Do you make the source code available? I can't seem to find any code repository links from your website.

Thanks

Rachid

unread,

Nov 29, 2019, 9:57:46 AM11/29/19

to CLARK Users

Hi Owen,

Yes, the source code is publicly available. As indicated in the CLARK peer-reviewed manuscript, and the welcome note of this google group, you can find the source and instructions at the "Webpage of CLARK: http://clark.cs.ucr.edu"