MAQ via SGE or other Grid Engine

Victor Ruotti

no leída,

28 ene 2009, 19:18:2328/1/09

a Grid Engine Life Science SIG

Hi,
Does anyone have a way to plug MAQ or GMAP to SGE?
I noticed the 0.7.1 version comes with a Perl script called farm-
run.pl. Sounds like it is more for LSF than for SGE.
Someone has to be running MAQ on a farm.

Please let me know if you do as we are trying to get some MAQ results
and will need to get it to a farm at some point.
Thanks in advance.
Victor Ruotti

Sean Davis

no leída,

28 ene 2009, 19:48:5628/1/09

a Victor Ruotti,Grid Engine Life Science SIG

This is untested, but you can just write a little script that takes parameters from the command line. If your standard command-line invocation looks like:

MAQ command <filename1> <filename2> ...

Then, just make your submit script look something like:

#!/bin/sh
MAQ $@

The $@ takes all the command-line arguments and passes them to MAQ. If you call the script MAQ_submit.sh, then you would do something like:

qsub MAQ_submit.sh command <filename1> <filename2> ...

That should do it. You can also use qrsh to submit the MAQ command directly. GMAP will work similarly. As an aside, you might look at bowtie, as well. It is quite a bit faster than either MAQ or GMAP.

Hope that helps.

Sean

Victor Ruotti

no leída,

28 ene 2009, 20:15:3128/1/09

a Grid Engine Life Science SIG

Thanks for the quick response.

Aha, that was exactly what I was looking for. Makes total sense.

Will be testing that tonight to see how it works.

I assume SGE will split at the command level. So, one maq command will go to a node?

It is taking me over a day to run just one lane of GAII RNA-seq using a 128GB of RAM on a 4x4 Xeon proc.

Our SGE nodes are not that powerful so I will take longer for one node to get that done as one command with the same file.

As an alternative maybe I should split the initial lane into multiple files and then do what you suggested.

That way each node only does a bit of the work. The question is, it is straight forward to put the results together at the end if you

split the initial fasta file into smaller pieces? Or that is not recommended? We are only doing mapping, no assembly, so hoping this is possible.

Victor

Sean Davis

no leída,

28 ene 2009, 21:08:1828/1/09

a Victor Ruotti,Grid Engine Life Science SIG

On Wed, Jan 28, 2009 at 8:15 PM, Victor Ruotti <ruo...@wisc.edu> wrote:

Thanks for the quick response.
Aha, that was exactly what I was looking for. Makes total sense.
Will be testing that tonight to see how it works.
I assume SGE will split at the command level. So, one maq command will go to a node?

It is taking me over a day to run just one lane of GAII RNA-seq using a 128GB of RAM on a 4x4 Xeon proc.
Our SGE nodes are not that powerful so I will take longer for one node to get that done as one command with the same file.

As an alternative maybe I should split the initial lane into multiple files and then do what you suggested.
That way each node only does a bit of the work. The question is, it is straight forward to put the results together at the end if you

split the initial fasta file into smaller pieces? Or that is not recommended? We are only doing mapping, no assembly, so hoping this is possible.

That is a large machine for doing this type of stuff. We have found that a good amount of RAM is about 4GB per core for most of the sequence-related stuff.

One MAQ will be run on one processor. If you have multiple processors per node, then you may end up with multiple MAQ on a given machine--not a problem. You can, of course, further parallelize lanes by splitting the sequence files. MAQ mapmerge is made to take multiple .map files and merge them, so combining the results is trivial.

As for time, with about 50 cores (the only serious timing we have done), we can align about 1 billion 36mers in about an hour using Bowtie. You might look at it for increased speed. If you need MAQ's downstream processing for SNP discovery, there are tools for converting from bowtie to MAQ format.

Sean

Quang Trinh

no leída,

28 ene 2009, 21:12:3528/1/09

a Victor Ruotti,Grid Engine Life Science SIG

Hi Victor,
We split our reads into 1,000,000 reads per file and then align the individual splitted files on different nodes on our cluster. Once all the alignments are done, we do mapmerge and then assembly, and so on. You can put all of this in one wrapper ... Hope this helps.

Q

On Wed, Jan 28, 2009 at 8:15 PM, Victor Ruotti <ruo...@wisc.edu> wrote:

Victor Ruotti

no leída,

30 ene 2009, 18:47:4430/1/09

a Grid Engine Life Science SIG

Can you share the program you use to split the reads?

I'm the process of writing a fasta/fastq/seq/prb splitter. I use Perl and thought about starting a bioperl module for next gen stuff.

They already have a bunch of modules to deal with qualities, so I was thinking adding a method to split fastq files for maq/gmap processing would be something good to have in bioperl. Great work had also being done with biostrings and maybe Martin can comment on this.

As simple as it might sound it would be good to have bioperl and maybe biostrings setup for this.

Will try to post this in the biostrings forum as well.

Any thoughts/interest on this?

Victor

Quang Trinh

no leída,

1 feb 2009, 8:24:471/2/09

a Victor Ruotti,Grid Engine Life Science SIG

Hi Victor,
We use "maq fastq2bfq -n 1000000 ..." to split the reads.

I wrote a bit of Perl code a while back for some post processing analysis but soon found out that Perl buffering data when writing to disk. To the point, it took down the node on our cluster. Not sure if anyone else have this problem.

The spitter module would be useful. I am not very good at coding so count me out on this project. :-) I would be happy to be the beta tester though. :-)

Q

Responder a todos

Responder al autor

Reenviar