[Genome] Running BLATs in parallel

22 views
Skip to first unread message

Assaf Gordon

unread,
Jul 5, 2010, 3:27:41 PM7/5/10
to gen...@soe.ucsc.edu
Hello,

A while ago there was a mention of running BLATs in parallel on a supercomputer ( https://lists.soe.ucsc.edu/pipermail/genome/2010-June/022692.html ).

I'd like to ask, if possible, what is the method you're using to run BLAT in parallel ?
Are you running multiple BLAT instances (on same node/multiple cores, or multiple nodes),
or is it some gfServer/gfClient configuration ?

Specifically,
I'm wondering about memory usage:
If I run multiple BLAT processes on a single machine (with same parameters), the 2bit database will get loaded multiple times and consume a lot of memory.

Any hint or advice will be appreciated,

thanks,
-gordon


Hiram Clawson

unread,
Jul 5, 2010, 4:45:02 PM7/5/10
to Assaf Gordon, gen...@soe.ucsc.edu
To run multiple blat processes in parallel you need to break your target
genome up into smaller pieces. It would be very inefficient to run your
queries each against the entire genome 2bit file. At least to single
chromosomes for a target, even better to smaller pieces. There are
several ways to partition your target pieces. faSplit is the
simplest.

Same goes for your query sequences if they happen to be very large.
Typically query sequences are already pretty small since they are
usually things such as cDNAs.

If you are trying to run genome to genome alignments, better
to use lastz instead of blat.
http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html

If you need controlling software for your super computer, parasol is
a simple system to manage multiple nodes and individual CPUs.
http://users.soe.ucsc.edu/~donnak/eng/parasol.htm

--Hiram
_______________________________________________
Genome maillist - Gen...@lists.soe.ucsc.edu
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Galt Barber

unread,
Jul 5, 2010, 5:04:42 PM7/5/10
to Assaf Gordon, gen...@soe.ucsc.edu
I could start by asserting
that if you have a REAL supercomputer,
you are not worried about RAM ;)

If it does become an issue, you could try
using a smaller number of gfServer instances
mixed with a somewhat larger number of gfClient
instances. The gfClient still needs to use
ram for the alignments that it is working on,
and does a significant amount of work.
We don't have anything that does this
automatically.

As Hiram mentioned, depending on what
you are doing you might want to split
up the query and target as another way
to both parallelize your job and
possibly reduce resource requirements.

Around here we are mostly running on
Beowulf-style commodity clusters
with a few hundred machines.
They tend to have 2 to 4 CPUs or cores,
and we tend to run an equal number
of BLAT jobs on each with Parasol.

Although not optimized for use on a supercomputer specifically,
Parasol is happy to run a number of jobs
on the machine. Simply supply the appropriate
config file. However, you may use any system that you like for running
your BLAT jobs.

-Galt

Ar 7/5/2010 12:27 PM, scr�obh Assaf Gordon:

Assaf Gordon

unread,
Jul 5, 2010, 5:37:10 PM7/5/10
to gen...@soe.ucsc.edu
Hiram, Galt,

Thank you for the prompt and detailed response.

The configuration I have is an SGE cluster multiple nodes of 4-cores/4GB-RAM each (not ideal, but that's what's available at the moment).
We already have some scripts that split both the database and the query files into smaller chunks (for programs other than blat),
but I was hoping to avoid the database-splitting overhead for blat.

-gordon


Galt Barber wrote, On 07/05/2010 05:04 PM:
> I could start by asserting that if you have a REAL supercomputer, you
> are not worried about RAM ;)
>
> If it does become an issue, you could try using a smaller number of
> gfServer instances mixed with a somewhat larger number of gfClient
> instances. The gfClient still needs to use ram for the alignments
> that it is working on, and does a significant amount of work. We
> don't have anything that does this automatically.
>
> As Hiram mentioned, depending on what you are doing you might want to
> split up the query and target as another way to both parallelize your
> job and possibly reduce resource requirements.
>
> Around here we are mostly running on Beowulf-style commodity
> clusters with a few hundred machines. They tend to have 2 to 4 CPUs
> or cores, and we tend to run an equal number of BLAT jobs on each
> with Parasol.
>
> Although not optimized for use on a supercomputer specifically,
> Parasol is happy to run a number of jobs on the machine. Simply
> supply the appropriate config file. However, you may use any system
> that you like for running your BLAT jobs.
>
> -Galt
>
> Ar 7/5/2010 12:27 PM, scrן¿½obh Assaf Gordon:
Reply all
Reply to author
Forward
0 new messages