Hi Rachid,
I'm trying to create 16S database from RDP. I downloaded the
Genbank formatted data and created fasta files using the following python script:
from Bio import SeqIO
import sys
input_handle = sys.stdin
for seq_record in SeqIO.parse(input_handle, "genbank") :
with open(seq_record.id + ".fasta", "w") as fasta: fasta.write(">%s\n%s\n" % (seq_record.id, seq_record.seq))
Then I executed set_targets:
./set_targets.sh /db custom
Loading accession number of all files... done (1421655)
Loading merged Tax ID... done
Retrieving taxonomy ID for each file... done (1421387 files were successfully mapped, and 268 unidentified).
custom: Retrieving taxonomy nodes for each sequence based on taxon ID...
Loading nodes of taxonomy tree... done.
Retrieving lineage for each sequence...
and then nothing. I can see that /opt/clark/CLARKSCV1.2.5/exe/getfilesToTaxNodes consumes 100% cpu. How long should I wait? Does it take more than a day?
Best regards
Jussi Volanen