think parallel !

claudiuskerth

unread,

Nov 8, 2012, 11:31:17 AM11/8/12

to NGS...@googlegroups.com

Hi guys,

if you need to run the same programme on many files and want to make use of the multi-core CPU power of your machine to speed things up, then I can highly recommend using GNU parallel: http://www.gnu.org/software/parallel/. So if the programme you are using is not multithreaded internally, like the very popular mapping programme Stampy, for instance, then just run it like:

$ parallel './stampy.py -g hg18 -h hg18 -M {} -o {.}.sam' ::: sample_*.fastq

... which will map the reads from as many samples in parallel as there are processors available. Limiting the number processors used is possible if you share a machine. Parallel is extreme flexible and the documentation vast. I recommend just having a look at the many examples provided in the documentation ($ man parallel).

claudius

Ludovic Duvaux

unread,

Nov 8, 2012, 1:01:51 PM11/8/12

to NGS...@googlegroups.com, claudiuskerth

Hi guys,

Although he tip from Claudius looks really interesting, the quantity of data to process is sometimes too important for our poor personal computer.
In this case, I shall advise you to run your stuffs on Iceberg (that is obvious you would tell me) but overall by using only the short queue (the one limited to 8 hours) for which we have, most of the time, immediate access to up to 75 cores.

Although mainly jobs may need more than 8 hours to be completed, check well the manual's program as a lot of programs have options allowing to split jobs in several part (it's notably true for mapping assembler, e.g. for Stampy it's the "--processpart=" option).

Ludovic

claudius --
You received this message because you are subscribed to the Google Groups "NGS Group APS Sheffield" group.
To post to this group, send an email to NGS...@googlegroups.com.
To unsubscribe from this group, send email to NGSshef+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/NGSshef?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.

-- 
******************************************************************
Ludovic Duvaux, Postdoctoral Research Associate

Department of Animal & Plant Sciences
Western Bank
University of Sheffield
Sheffield, S10 2TN
United Kingdom 

Tel: +44 (0) 1142220112

******************************************************************

claudiuskerth

unread,

Nov 8, 2012, 3:02:23 PM11/8/12

to NGS...@googlegroups.com

Just to make things clear for everybody, Ludivic:

could you maybe post one of your iceberg job submission bash scripts that accomplishes what you were suggesting? Some helpful comments on it would be nice too :-)

many thanks

claudius

Ludovic Duvaux

unread,

Nov 9, 2012, 2:38:39 PM11/9/12

to NGS...@googlegroups.com, claudiuskerth

Hi,

well may be a bit complicated for people not familiar with scripting but let's try to explain.

My parallelization on Iceberg works in 2 steps:
1) Submitting an array of identical tasks using SGE (commands lines enclosed in a bash script)
2) Extracting information from a text file to give specific orders to each task.

1) Submitting an array of identical tasks using SGE (i.e. 1job -> n tasks)
Set up the bash script using the options:
- '#$ -l h_rt=7:59:59' (time required, observe that it is < than 8hours in order to be processed on the short queue for which we have almost no waiting time for our jobs to be processed).
- '#$ -t 1-120' (Submits an array of identical tasks being only differentiated by an index number and being treated by Grid Engine almost like a series of jobs).

As an example, you can see the script "Picard.CalculateHSmetrics.sh" attached (I advise you to use a text editor able to recognize the syntax, like Geany or so, to open this file). The first part of this script (up to '#$ -t 1-120') is the SGE setting up for iceberg and the second part is what the script is really doing.

My script is doing three things:
a) setting up some important variables like the job taskid that will be useful latter (e.g. 'jobid=${JOB_ID}' ; 'taskid=${SGE_TASK_ID}' ; 'host=${HOSTNAME}')
b) writing times to the log files in order to have an idea of the time needed to process the job
c) calling a R script called “01_Picard.CalculateHSmetrics.R”.
Note that in the R command line, the argument '--args' allow you to pass the bash variables into R variables. Note also that you can use bash variables to call a given script or to give specific names to your log files. For instance, the R command line calls the script “01_Picard.CalculateHSmetrics.R” thanks to the 'fil' variable used in '$fil.R' (thus if my job number is 1532 and the taskid is 74, my log file will be called “/data/bo1ld/06_NERC_capture/03_RemoveDuplicatedLoci/00_PropSeq.HSmetrics_Rlog/01_Picard.CalculateHSmetrics.1532.74.log”).

2) Extracting information from a text file to give specific orders to the different tasks of your job
As shown above, I personally use a R scripts to do so but one can also use a Perl, Python or even a Bash one depending on which language he prefers (and the possibilities offered by this language).
The idea here is that the R script will look for specific commands, corresponding to the task id, from inside a text file previously prepared by the user (see “00_ProcessStampyOutput_details.1.txt”).
The R function 'argt<-commandArgs(TRUE)' allow you to extract the bash variable previously recorded as R variable using the argument '--args'.
At last, using all this information my R script is able, using the function 'system', to run my program of interest with different options/files for the different tasks.

The explanation is a bit long, but let's have a concrete example:
1) in the Iceberg shell I run 'qsub Picard.CalculateHSmetrics.sh'
2) this script will run 120 times the R script “01_Picard.CalculateHSmetrics.R”, but each time with a different task id.
3) Using the function 'system', the R script is able to call the Picard tools “CalculateHsMetrics.jar” package with specific options as defined by the task id and the “00_ProcessStampyOutput_details.1.txt” file (for example, for the task id 25, the R script will process the file whose the name begin with “Lathyrus_282”.

At last believe it or not, this is a simple example where I do the same thing 120 times but with different files. The good thing with this approach is that you can configure your scripts to do much more complicated parallelization allowing different task id to have different options, files or part of file to be processed (like with the '--processpart=' option of Stampy)...
As an example, see the files “01_Stampy_FinalRun_details.txt” and “02_Stampy_Final_20120420.R” with which I processed 4098 different analyses calling only 1 job (all the tasks of this job being processed on the short queue).

If you need some help, you can email me.

Ludovic

Picard.CalculateHSmetrics.sh

01_Picard.CalculateHSmetrics.R

00_ProcessStampyOutput_details.1.txt

01_Stampy_FinalRun_details.txt

02_Stampy_Final_20120420.R

Reply all

Reply to author

Forward