Hi Rachid,
Submitted to a computing cluster via a batch job, I first ran
$ set_targets.sh DIR_DB bacteria viruses fungi --species
output: a DIR_DB folder in the CLARK directory of my local machine.
I next tried to classify the samples using a loop to go through my subdirectories of samples with a parallel batch job:
#Defined the paths of the directories
WORK_DIR=/projectnb/cnidaria/mdot/
DATABASE=$WORK_DIR/Analysis_Tools/CLARKSCV1.2.6.1
DATA_IN=$WORK_DIR/parsed_samples/
DATA_OUT_CLASS=$WORK_DIR/clark_outputs/classified
DATA_OUT_ABUN=$WORK_DIR/clark_outputs/abundaces
DIR_DB=$DATABASE/DIR_DB
cd $DATA_IN/Mangrove_Samples
MANGROVE_SAMPLE_DIR_LIST=($(find . -mindepth 1 -maxdepth 1 -type d)) # get all the sample dirs
#Going back to the CLARK directory because it needs to be run where the DIR_DB is
cd $DATABASE
for i in ${!MANGROVE_SAMPLE_DIR_LIST[@]}
do
#echo "i=$i"
#Running the job in batches, telling the system how to sort which job into which task
batch=$(($i%$SGE_TASK_LAST + 1))
if [ $SGE_TASK_ID -eq $batch ]
then
#Defines the sample directories list as a variable
m_sample_dir=${MANGROVE_SAMPLE_DIR_LIST[$i]}
#Makes new folders for the output
mkdir -p $DATA_OUT_CLASS/Mangrove_Samples
mkdir -p $DATA_OUT_CLASS/Mangrove_Samples/${m_sample_dir}
#Sample Classification
echo "start classification for sample $m_sample_dir at $(date)"
classify_metagenome.sh -P $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R1.fq $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R2.fq -R ${m_sample_dir}_M
echo "finished classification for sample $m_sample_dir at $(date)"
mv ${m_sample_dir}_M.csv $DATA_OUT_CLASS/Mangrove_Samples/${m_sample_dir}
fi
done
The output of this was:
start classification for sample ./M5GX_raw at Thu Jun 25 06:56:11 EDT 2020
CLARK version 1.2.6.1 (UCR CS&E. Copyright 2013-2019 Rachid Ounit, roun...@cs.ucr.edu)
The program did not find the database files for the provided settings and reference sequences (14895 targets). The program will b$
Starting the creation of the database of targets specific 31-mers from input files...
I thought maybe it wasn't working because I cd'd out to make the list of samples and then cd'd back into CLARK. I also know that you made a high throughput method of doing this, so I tried to incorporate that as well with the following commands:
#Defined the paths of the directories
WORK_DIR=/projectnb/cnidaria/mdot/
DATABASE=$WORK_DIR/Analysis_Tools/CLARKSCV1.2.6.1
DATA_IN=$WORK_DIR/parsed_samples/
DATA_OUT_CLASS=$WORK_DIR/clark_outputs/classified
DATA_OUT_ABUN=$WORK_DIR/clark_outputs/abundaces
DIR_DB=$DATABASE/DIR_DB
cd $DATA_IN/Mangrove_Samples
MANGROVE_SAMPLE_DIR_LIST=($(find . -mindepth 1 -maxdepth 1 -type d)) # get all the sample dirs
m_sample_dir=${MANGROVE_SAMPLE_DIR_LIST[$i]}
cd $DATABASE
#Creating textfiles of sample for classify_metagenome.sh, textfiles stored in CLARK directory with DIR_DB
echo $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R1.fq >> mangrove_r1.txt
echo $DATA_IN/Mangrove_Samples/${m_sample_dir}/${m_sample_dir}_M_R2.fq >> mangrove_r2.tx
#classifying samples
classify_metagenome.sh -P mangrove_r1.txt mangrove_r2.txt -R mangrove_results
Output: same as above. Can't find the database, tries to rebuild it.
I thought it was really strange, so I tried to just run the sample without a scripted batch job and right on the command line with two test reads I moved to my home folder
Input:
classify_metagenome.sh -P 35_raw_M_R1.fq 35_raw_M_R2.fq -R test
Output:
The program did not find the database files for the provided settings and reference sequences (14895 targets). The program will build them.
Starting the creation of the database of targets specific 31-mers from input files...
I'm totally at a loss here, I've tried pretty much everything. I even redownloaded CLARK, redownloaded the database and retried all of the above steps, I only ever get the above message.
Thanks,
Marisol