classify_metagenomes unable to find DIR

Marisol Ortiz

unread,

Jun 24, 2020, 7:11:28 AM6/24/20

to CLARK Users

Hi Rachid,

I am trying to use classify_metagenomes.sh after running set_targets.sh DIR_DB bacteria viruses fungi --species and successfully downloading the database into my CLARK directory. Ideally I would like to run my paired end samples in high throughput, so I followed the guide to do this.I made the text files but the program does not recognize the DIR_DB and tries to build the database anyway.

After this I figured it was something with the text file I created, which just had paths to the individual files. I tried to just run classify_metagenomes.sh in my CLARK directory on a single set of paired end reads that I moved to the CLARK directory just as a test to see if the program could recognize the database at all AND I reran set_targets.sh DIR_DB bacteria viruses fungi --species just to make sure. Same error, program will build the database.

Very unsure what the next best thing to do is, please help me!!!!

Marisol

Rachid

unread,

Jun 24, 2020, 6:41:57 PM6/24/20

to CLARK Users

Hi Marisol,

Could you please the commands you ran and what was the output on the console, so we can see what is going on on your side?

Thank you,

Best,

Rachid

Marisol Ortiz

unread,

Jun 25, 2020, 7:18:59 AM6/25/20

to CLARK Users

Hi Rachid,

Submitted to a computing cluster via a batch job, I first ran

$ set_targets.sh DIR_DB bacteria viruses fungi --species

output: a DIR_DB folder in the CLARK directory of my local machine.

I next tried to classify the samples using a loop to go through my subdirectories of samples with a parallel batch job:

#Defined the paths of the directories

WORK_DIR=/projectnb/cnidaria/mdot/

DATABASE=$WORK_DIR/Analysis_Tools/CLARKSCV1.2.6.1

DATA_IN=$WORK_DIR/parsed_samples/

DATA_OUT_CLASS=$WORK_DIR/clark_outputs/classified

DATA_OUT_ABUN=$WORK_DIR/clark_outputs/abundaces

DIR_DB=$DATABASE/DIR_DB

cd $DATA_IN/Mangrove_Samples

MANGROVE_SAMPLE_DIR_LIST=($(find . -mindepth 1 -maxdepth 1 -type d)) # get all the sample dirs

#Going back to the CLARK directory because it needs to be run where the DIR_DB is

cd $DATABASE

for i in ${!MANGROVE_SAMPLE_DIR_LIST[@]}

do

#echo "i=$i"

#Running the job in batches, telling the system how to sort which job into which task

batch=$(($i%$SGE_TASK_LAST + 1))

if [ $SGE_TASK_ID -eq $batch ]

then

#Defines the sample directories list as a variable

m_sample_dir=${MANGROVE_SAMPLE_DIR_LIST[$i]}
#Makes new folders for the output

mkdir -p $DATA_OUT_CLASS/Mangrove_Samples

mkdir -p $DATA_OUT_CLASS/Mangrove_Samples/${m_sample_dir}

#Sample Classification

echo "start classification for sample $m_sample_dir at $(date)"

classify_metagenome.sh -P $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R1.fq $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R2.fq -R ${m_sample_dir}_M

echo "finished classification for sample $m_sample_dir at $(date)"

mv ${m_sample_dir}_M.csv $DATA_OUT_CLASS/Mangrove_Samples/${m_sample_dir}

fi

done

The output of this was:

start classification for sample ./M5GX_raw at Thu Jun 25 06:56:11 EDT 2020

The program did not find the database files for the provided settings and reference sequences (14895 targets). The program will b$

Starting the creation of the database of targets specific 31-mers from input files...

I thought maybe it wasn't working because I cd'd out to make the list of samples and then cd'd back into CLARK. I also know that you made a high throughput method of doing this, so I tried to incorporate that as well with the following commands:

#Defined the paths of the directories

WORK_DIR=/projectnb/cnidaria/mdot/

DATABASE=$WORK_DIR/Analysis_Tools/CLARKSCV1.2.6.1

DATA_IN=$WORK_DIR/parsed_samples/

DATA_OUT_CLASS=$WORK_DIR/clark_outputs/classified

DATA_OUT_ABUN=$WORK_DIR/clark_outputs/abundaces

DIR_DB=$DATABASE/DIR_DB

cd $DATA_IN/Mangrove_Samples

MANGROVE_SAMPLE_DIR_LIST=($(find . -mindepth 1 -maxdepth 1 -type d)) # get all the sample dirs

m_sample_dir=${MANGROVE_SAMPLE_DIR_LIST[$i]}

cd $DATABASE

#Creating textfiles of sample for classify_metagenome.sh, textfiles stored in CLARK directory with DIR_DB

echo $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R1.fq >> mangrove_r1.txt

echo $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R2.fq >> mangrove_r2.tx
#classifying samples

classify_metagenome.sh -P mangrove_r1.txt mangrove_r2.txt -R mangrove_results

Output: same as above. Can't find the database, tries to rebuild it.

I thought it was really strange, so I tried to just run the sample without a scripted batch job and right on the command line with two test reads I moved to my home folder

Input:

classify_metagenome.sh -P 35_raw_M_R1.fq 35_raw_M_R2.fq -R test

Output:

The program did not find the database files for the provided settings and reference sequences (14895 targets). The program will build them.

Starting the creation of the database of targets specific 31-mers from input files...

I'm totally at a loss here, I've tried pretty much everything. I even redownloaded CLARK, redownloaded the database and retried all of the above steps, I only ever get the above message.

Thanks,

Marisol

Rachid

unread,

Jun 25, 2020, 12:08:56 PM6/25/20

to CLARK Users

Hi Marisol,

Thank you for sharing this info. It seems you've a quite your own set up which is likely where the issue comes from.

I did not dive in fully in the details and technicality of your code here, but it does seem like the issue is here:

"

cd $DATA_IN/Mangrove_Samples

MANGROVE_SAMPLE_DIR_LIST=($(find . -mindepth 1 -maxdepth 1 -type d)) # get all the sample dirs

#Going back to the CLARK directory because it needs to be run where the DIR_DB is

cd $DATABASE

"

Why "cd $DATA_IN/Mangrove_Samples" is necessary? And mostly importantly why "cd $DATABASE" is necessary? It is not true that CLARK needs to run "where the DIR_DB".

My recommendation is to ask you to start executing CLARK with a basic run or example, many are provided in the README file and also on the Overview page of the CLARK webpage.

Have you run any of it as a try out ?

An approach for your to consider is to start with one of these basic examples and then from it build up a more refined set up like yours.

Best,

Rachid

Marisol Ortiz

unread,

Jun 25, 2020, 12:28:55 PM6/25/20

to CLARK Users

Hi Rachid,

As I mentioned at the end of the last question, I tried just run this function straight from the command line in the CLARK directory with a set of paired reads copied into the CLARK directory (35_raw_M_R1.fq) and it gave me the same error (see the last line of my comment).

Given this, are there any other suggestions that would be helpful?

Rachid

unread,

Jun 25, 2020, 12:53:14 PM6/25/20

to CLARK Users

Marisol,

On my side I followed and executed all the steps indicated in the README or overview page, and had no issue.

Look at my comments on your "cd" usages that are likely the cause of your problems, no?

Best,

Rachid

Marisol Ortiz

unread,

Jun 25, 2020, 1:10:01 PM6/25/20

to CLARK Users

Hi Rachid,

Is there anything that might trigger the "database not found" error besides the database not being present? Because I can see the folder DIR_DB and all the appropriate sequences, and I am having a hard time following why the program cannot recognize it even when I run the command classify_metagenomes.sh in the command line from the CLARK directory. Here is a picture that shows exactly what I did and exactly what happened:

Would we be able to talk more at length off this forum?

Reply all

Reply to author

Forward

classify_metagenomes unable to find DIR_DB

Marisol Ortiz

Rachid

Marisol Ortiz

Rachid

Marisol Ortiz

Rachid

Marisol Ortiz