claudius --
You received this message because you are subscribed to the Google Groups "NGS Group APS Sheffield" group.
To post to this group, send an email to NGS...@googlegroups.com.
To unsubscribe from this group, send email to NGSshef+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/NGSshef?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.
-- ****************************************************************** Ludovic Duvaux, Postdoctoral Research Associate Department of Animal & Plant Sciences Western Bank University of Sheffield Sheffield, S10 2TN United Kingdom Tel: +44 (0) 1142220112 ******************************************************************
My
parallelization on Iceberg works in 2 steps:
1) Submitting an
array of identical tasks using SGE (commands lines enclosed in a
bash
script)
2) Extracting information from a text file to give
specific orders to each task.
1) Submitting an
array of identical tasks using SGE (i.e. 1job -> n tasks)
Set
up the bash script using the options:
- '#$
-l h_rt=7:59:59' (time required, observe that it is
<
than 8hours in order to be processed on the short queue for
which we
have almost no waiting time for our jobs to be processed).
-
'#$ -t 1-120' (Submits an
array
of identical tasks being only differentiated by an index number
and
being treated by Grid Engine almost like a series of jobs).
As an example, you can see the script "Picard.CalculateHSmetrics.sh" attached (I advise you to use a text editor able to recognize the syntax, like Geany or so, to open this file). The first part of this script (up to '#$ -t 1-120') is the SGE setting up for iceberg and the second part is what the script is really doing.
My
script is doing three things:
a) setting up some important
variables like the job taskid that will be useful latter (e.g.
'jobid=${JOB_ID}' ;
'taskid=${SGE_TASK_ID}' ;
'host=${HOSTNAME}')
b)
writing times to the log files in order to have an idea of the
time
needed to process the job
c) calling a R script called
“01_Picard.CalculateHSmetrics.R”.
Note that in the R command
line, the argument '--args'
allow you to pass the bash variables into R variables. Note also
that
you can use bash variables to call a given script or to give
specific
names to your log files. For instance, the R command line calls
the
script “01_Picard.CalculateHSmetrics.R” thanks to the 'fil'
variable used in '$fil.R'
(thus
if my job number is 1532 and the taskid is 74, my log file will
be
called
“/data/bo1ld/06_NERC_capture/03_RemoveDuplicatedLoci/00_PropSeq.HSmetrics_Rlog/01_Picard.CalculateHSmetrics.1532.74.log”).
2)
Extracting information from a text file to give specific
orders to
the different tasks of your job
As shown above, I personally
use a R scripts to do so but one can also use a Perl, Python or
even
a Bash one depending on which language he prefers (and the
possibilities offered by this language).
The idea here is that
the R script will look for specific commands, corresponding to
the
task id, from inside a text file previously prepared by the user
(see
“00_ProcessStampyOutput_details.1.txt”).
The R function
'argt<-commandArgs(TRUE)'
allow you to extract the bash variable previously recorded as R
variable using the argument '--args'.
At
last, using all this information my R script is able, using
the function 'system',
to run my
program of interest with different options/files for the
different tasks.
The
explanation is a bit long, but let's have a concrete example:
1)
in the Iceberg shell I run 'qsub
Picard.CalculateHSmetrics.sh'
2) this script will run 120
times the R script “01_Picard.CalculateHSmetrics.R”, but each
time with a different task id.
3) Using the function 'system',
the
R script is able to call the Picard tools
“CalculateHsMetrics.jar” package with specific options as
defined
by the task id and the “00_ProcessStampyOutput_details.1.txt”
file (for example, for the task id 25, the R script will process
the
file whose the name begin with “Lathyrus_282”.
At
last believe it or not, this is a simple example where I do the
same
thing 120 times but with different files. The good thing with
this
approach is that you can configure your scripts to do much more
complicated parallelization allowing different task id to have
different options, files or part of file to be processed (like
with the '--processpart='
option of Stampy)...
As an
example, see the files “01_Stampy_FinalRun_details.txt” and
“02_Stampy_Final_20120420.R” with which I processed 4098
different analyses calling only 1 job (all the tasks of this job
being processed on the short queue).
If you need some help,
you can email me.