System requirements

50 views
Skip to first unread message

jai

unread,
Sep 5, 2013, 11:57:15 PM9/5/13
to gotc...@googlegroups.com
Dear all,
I am new to this group.
Can anyone guide me about the system requirements to carry out the analysis?.

Thanks
jai

Mary Kate Wing

unread,
Sep 6, 2013, 10:26:45 AM9/6/13
to jai, gotcloud
It has been tested on Ubuntu 10.4 (lucid), 12.04.2 LTS, and 12.10, so we recommend one of those versions or newer. 
It has also been run on a recent version of CentOS (Redhat).  You can try other linux systems and it should work, but has not yet been tested.  If you are using a different version, let me know if you run into any problems, and we should be able to get it going.
On some versions you may need to build from source, as not all are compatible with the binary distribution.

GotCloud requires java, gnu make/g++, and libssl (you can check for these by running gotcloud/scripts/check_requirements.sh).
It also requires perl and zlib for both c++ & for perl (Zlib.pm).

Let me know if you have any additional questions or notice anything else .

Mary Kate Wing


--
You received this message because you are subscribed to the Google Groups "GotCloud" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gotcloud+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Casey Vu

unread,
Nov 6, 2013, 2:38:13 PM11/6/13
to gotc...@googlegroups.com, jai
First of all, thanks for the info, it is useful. But I have another question regarding running GotCloud in a cluster. 
I don't have root access to the machines in this cluster so installing things like slurm, mosix will be quite a hassle. I wonder if you can explain how GotCloud is supposed to run in a distributed manner so maybe I can find a way to work around.

Thanks a lot,
Casey

Mary Kate Wing

unread,
Nov 6, 2013, 2:56:41 PM11/6/13
to Casey Vu, gotcloud, jai
It depends on the type of scheduler used on your cluster.

If the cluster uses:
   Sun Grid Engine and supports qrsh, set in your configuration:
       BATCH_TYPE = sgei
   slurm and supports srun, set in your configuration:
       BATCH_TYPE = slurmi
   mosix and supports mosbatch, set in your configuration:
       BATCH_TYPE = mosix

Set BATCH_OPTS to any additional options you need to specify.

You should see the proper cluster commands in your Makefile produced by gotcloud.

If you are running a different system that supports an interactive mode, we can help you update gotcloud/bin/Multi.pm to support this.

We are currently working to update gotcloud to support non-interactive cluster commands.

For now, if your system is not supported by GotCloud and you are running the aligner, you can use GotCloud to generate the Makefiles for each sample, but not run them.
You can then manually send each Makefile out to your cluster, running each sample separately.

If you provide more information on how you send jobs to your cluster, I will be able to provide better information and help get you going.

Mary Kate

Casey Vu

unread,
Nov 6, 2013, 5:56:38 PM11/6/13
to gotc...@googlegroups.com, Casey Vu, jai
Thanks for your reply. I'll try to explain my situation:

Project purpose:
We want to do a comparison on performance of different aligners, specifically a comparison between distributed/parallel implementations and non-parallel ones (either a totally different one, or running without distribution) to see how practical it is to run sequence alignment in a distributed manner.
And we are mainly interested in the "align" step. 

Cluster:
The resource manager available (to my best knowledge till now) is YARN (with HDFS, of course).

Questions:
1. What I understand now is: I can create jobs (one for each sample) and send them to YARN resource manager. However, I'm still unclear how I'm going to generate the Makefiles without running them. I guess there are some parameters for that in the conf file.

2. Is it possible to parallelize within one sample? or even within one FASTQ file?

Thanks you very much.
Casey 

Mary Kate Wing

unread,
Nov 7, 2013, 9:21:50 AM11/7/13
to Casey Vu, gotcloud, jai
I have not used YARN, but we can take a look at it.

1)  Run gotcloud with the "--dryrun" command-line option to create the makefiles without running them.

2) Unfortunately, in the current implementation, there is not a very convenient way to run a single sample across a cluster.
You can parallelize within a sample by running the per-sample Makefile with a -j option.  There is 1 target per FASTQ, so those can all be run in parallel on a single machine - if the machine is large enough to support it.
You can run parts of BWA as multi-threaded by setting in your configuration file: BWA_THREADS = -t 1
Change the one to however many you want to run in parallel on that machine.

You could also manually split your fastqs into smaller files prior to running and just specify them all in the index file of fastqs.

It is technically possible to run a single sample across a cluster, but it does require a bit of manual work and some "simple" scripting.  The Makefile contains targets and dependencies.  So as long as the directories used by gotcloud are accessible from all machines in your cluster, you could parse the Makefile and send cluster commands for each Makefile target, properly setting the dependencies on those commands so they run in the correct order.

We've been looking at that for our snpcalling pipeline, but the same idea should work for the alignment pipeline.

Note: The fastq steps for a given sample can be run in parallel for as many fastqs as there are.  After the BAMs for each fastq/pair have been created, there is just a single BAM per sample.  Those steps are currently just a single thread with no way to run in parallel.

Mary Kate



Casey Vu

unread,
Nov 7, 2013, 10:53:54 AM11/7/13
to gotc...@googlegroups.com, Casey Vu, jai
Thank a lots, it gives me a general idea of how things work.
If you can take a look at YARN, that's would be great.

I have one last question (for now): what is the level of parallelization when GotCloud runs with the schedulers that you listed? (One or multiple job(s) per sample).

Lastly, thank again for your reply. It has been very helpful.

Casey

Mary Kate Wing

unread,
Nov 7, 2013, 11:03:15 AM11/7/13
to Casey Vu, gotcloud, jai
For the aligner, with slurm, mosix, and sun grid engine, a single sample in GotCloud will run on a single node in the cluster.  It will not spread out to multiple nodes, but different samples can be sent to different nodes.  The number of samples to run in parallel at a time is dictated by the "--numcs" parameter, number of concurrent samples.

While a single sample is confined to a single node in the cluster, it can run multiple jobs on that one node.  This is specified as --numjobs (number of jobs per sample).

Does that answer your question?
Reply all
Reply to author
Forward
0 new messages