classify_metagenomes unable to find DIR_DB

39 views
Skip to first unread message

Marisol Ortiz

unread,
Jun 24, 2020, 7:11:28 AM6/24/20
to CLARK Users
Hi Rachid, 

I am trying to use classify_metagenomes.sh after running  set_targets.sh DIR_DB bacteria viruses fungi --species and successfully downloading the database into my CLARK directory. Ideally I would like to run my paired end samples in high throughput, so I followed the guide to do this.I made the text files but the program does not recognize the DIR_DB and tries to build the database anyway.

After this I figured it was something with the text file I created, which just had paths to the individual files. I tried to just run classify_metagenomes.sh in my CLARK directory on a single set of paired end reads that I moved to the CLARK directory just as a test to see if the program could recognize the database at all AND I reran set_targets.sh DIR_DB bacteria viruses fungi --species just to make sure. Same error, program will build the database. 

Very unsure what the next best thing to do is, please help me!!!! 

Marisol 

Rachid

unread,
Jun 24, 2020, 6:41:57 PM6/24/20
to CLARK Users
Hi Marisol,

Could you please the commands you ran and what was the output on the console, so we can see what is going on on your side?
Thank you,

Best,
Rachid

Marisol Ortiz

unread,
Jun 25, 2020, 7:18:59 AM6/25/20
to CLARK Users
Hi Rachid, 

Submitted to a computing cluster via a batch job, I first ran 

$ set_targets.sh DIR_DB bacteria viruses fungi --species 

output: a DIR_DB folder in the CLARK directory of my local machine.

I next tried to classify the samples using a loop to go through my subdirectories of samples with a parallel batch job:

#Defined the paths of the directories
WORK_DIR=/projectnb/cnidaria/mdot/
DATABASE=$WORK_DIR/Analysis_Tools/CLARKSCV1.2.6.1
DATA_IN=$WORK_DIR/parsed_samples/  
DATA_OUT_CLASS=$WORK_DIR/clark_outputs/classified
DATA_OUT_ABUN=$WORK_DIR/clark_outputs/abundaces
DIR_DB=$DATABASE/DIR_DB


cd $DATA_IN/Mangrove_Samples

MANGROVE_SAMPLE_DIR_LIST=($(find . -mindepth 1 -maxdepth 1 -type d))  # get all the sample dirs

#Going back to the CLARK directory because it needs to be run where the DIR_DB is
cd $DATABASE

for i in ${!MANGROVE_SAMPLE_DIR_LIST[@]}
do
    #echo "i=$i"
     #Running the job in batches, telling the system how to sort which job into which task
        batch=$(($i%$SGE_TASK_LAST + 1))
if [ $SGE_TASK_ID -eq $batch ]
then 
        #Defines the sample directories list as a variable
        m_sample_dir=${MANGROVE_SAMPLE_DIR_LIST[$i]}
        #Makes new folders for the output 
        mkdir -p $DATA_OUT_CLASS/Mangrove_Samples
       mkdir -p $DATA_OUT_CLASS/Mangrove_Samples/${m_sample_dir}
        #Sample Classification 
       echo "start classification for sample $m_sample_dir at $(date)"
classify_metagenome.sh -P $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R1.fq $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R2.fq -R ${m_sample_dir}_M 
       echo "finished classification for sample $m_sample_dir at $(date)"
       mv ${m_sample_dir}_M.csv $DATA_OUT_CLASS/Mangrove_Samples/${m_sample_dir} 
      fi
done

The output of this was:

start classification for sample ./M5GX_raw at Thu Jun 25 06:56:11 EDT 2020

CLARK version 1.2.6.1 (UCR CS&E. Copyright 2013-2019 Rachid Ounit, roun...@cs.ucr.edu)

The program did not find the database files for the provided settings and reference sequences (14895 targets). The program will b$

Starting the creation of the database of targets specific 31-mers from input files...


I thought maybe it wasn't working because I cd'd out to make the list of samples and then cd'd back into CLARK. I also know that you made a high throughput method of doing this, so I tried to incorporate that as well with the following commands:

#Defined the paths of the directories
WORK_DIR=/projectnb/cnidaria/mdot/
DATABASE=$WORK_DIR/Analysis_Tools/CLARKSCV1.2.6.1
DATA_IN=$WORK_DIR/parsed_samples/  
DATA_OUT_CLASS=$WORK_DIR/clark_outputs/classified
DATA_OUT_ABUN=$WORK_DIR/clark_outputs/abundaces
DIR_DB=$DATABASE/DIR_DB

cd $DATA_IN/Mangrove_Samples

MANGROVE_SAMPLE_DIR_LIST=($(find . -mindepth 1 -maxdepth 1 -type d))  # get all the sample dirs
m_sample_dir=${MANGROVE_SAMPLE_DIR_LIST[$i]}

cd $DATABASE

#Creating textfiles of sample for classify_metagenome.sh, textfiles stored in CLARK directory with DIR_DB
 echo $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R1.fq >> mangrove_r1.txt
 echo $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R2.fq >> mangrove_r2.tx
#classifying samples
classify_metagenome.sh -P mangrove_r1.txt mangrove_r2.txt -R mangrove_results

Output: same as above. Can't find the database, tries to rebuild it.

I thought it was really strange, so I tried to just run the sample without a scripted batch job and right on the command line with two test reads I moved to my home folder
Input:

classify_metagenome.sh -P 35_raw_M_R1.fq 35_raw_M_R2.fq -R test


Output:

The program did not find the database files for the provided settings and reference sequences (14895 targets). The program will build them.

Starting the creation of the database of targets specific 31-mers from input files...


I'm totally at a loss here, I've tried pretty much everything. I even redownloaded CLARK, redownloaded the database and retried all of the above steps, I only ever get the above message.
Thanks, 

Marisol

Rachid

unread,
Jun 25, 2020, 12:08:56 PM6/25/20
to CLARK Users
Hi Marisol,

Thank you for sharing this info. It seems you've a quite your own set up which is likely where the issue comes from.

I did not dive in fully in the details and technicality of your code here, but it does seem like the issue is here:
"
cd $DATA_IN/Mangrove_Samples

MANGROVE_SAMPLE_DIR_LIST=($(find . -mindepth 1 -maxdepth 1 -type d))  # get all the sample dirs

#Going back to the CLARK directory because it needs to be run where the DIR_DB is
cd $DATABASE
"

Why "cd $DATA_IN/Mangrove_Samples" is necessary? And mostly importantly why "cd $DATABASE" is necessary? It is not true that CLARK needs to run "where the DIR_DB".

My recommendation is to ask you to start executing CLARK with a basic run or example, many are provided in the README file and also on the Overview page of the CLARK webpage. 
Have you run any of it as a try out ?

An approach for your to consider is to start with one of these basic examples and then from it build up a more refined set up like yours.

Best,
Rachid

Marisol Ortiz

unread,
Jun 25, 2020, 12:28:55 PM6/25/20
to CLARK Users
Hi Rachid, 

As I mentioned at the end of the last question, I tried just run this function straight from the command line in the CLARK directory with a set of paired reads copied into the CLARK directory (35_raw_M_R1.fq) and it gave me the same error (see the last line of my comment). 

Given this, are there any other suggestions that would be helpful?

Rachid

unread,
Jun 25, 2020, 12:53:14 PM6/25/20
to CLARK Users
Marisol,
On my side I followed and executed all the steps indicated in the README or overview page, and had no issue.
Look at my comments on your "cd" usages that are likely the cause of your problems, no?
Best,
Rachid

Marisol Ortiz

unread,
Jun 25, 2020, 1:10:01 PM6/25/20
to CLARK Users
Hi Rachid, 

Is there anything that might trigger the "database not found" error besides the database not being present? Because I can see the folder DIR_DB and all the appropriate sequences, and I am having a hard time following why the program cannot recognize it even when I run the command classify_metagenomes.sh in the command line from the CLARK directory. Here is a picture that shows exactly what I did and exactly what happened:

code.jpeg

Would we be able to talk more at length off this forum?
Reply all
Reply to author
Forward
0 new messages