solexa IT infrastructure advice

Duke

unread,

Jun 12, 2009, 3:14:15 PM6/12/09

to sol...@googlegroups.com

Hi everybody,

Recently I moved from Physics to Biology world and I am trying to learn
as much as I can, so any input / advice / suggestion / opinions or
experience is welcome and greatly appreciated.

So right now I in a project of building a new DNA sequencing system with
Solexa / Genome Analyzer from Illumina, and in preparation for the
machine to come this September, we have to figure out what IT
infrastructure system we should employ for the Solexa with the most
efficient and economic way. Since we do not have much space in the lab,
we try to keep everything as neat as possible, and my questions relate
to the down-stream analysis. There are two choices: buying a powerful
computer (quad-core chips, some 128GB of memory etc...) or using the
supercomputer and linux cluster (a lot of memory and nodes, unlimited
storage with a Gb connection) we have here in the institute. My
questions are:

1. what are the disadvantages and inconvenience we may have if we do not
have a stand alone computer and use the linux cluster?

2. what are the advantages if we buy a computer instead of using the
cluster?

Since I am new to the field, I will appreciate any advice you have.

Thank you all in advance,

D.

Andrew Gagne

unread,

Jun 12, 2009, 5:12:57 PM6/12/09

to sol...@googlegroups.com

We utilize our school's cluster for analysis.

Perks - lots of cpus
- cheaper
- redundant systems
- someone else administers systems

cons - no large memory machines. the largest we have are 32 GB
- not in total control over system, others may trump our jobs

overall we have not had too much problem, although as we gain more
instruments it may turn into a bottleneck

Duke

unread,

Jun 12, 2009, 5:33:59 PM6/12/09

to sol...@googlegroups.com

On 6/12/09 5:12 PM, Andrew Gagne wrote:

We utilize our school's cluster for analysis.

Perks - lots of cpus
- cheaper
- redundant systems
- someone else administers systems

Thanks Andrew for sharing your experience. How do you take advantages of lots of CPUs if, as I heard of, solexa software (the pipeline software) that Illumina offers does not have a parallelized version? Or you write your own codes to do the analysis?

cons - no large memory machines. the largest we have are 32 GB
- not in total control over system, others may trump our jobs

How about data transferring if you use cluster system? Is it a problem if considering that sequencing data is quite large (some tens to hundreds GB)?

Thanks,

D.

Andrew Gagne

unread,

Jun 12, 2009, 6:31:13 PM6/12/09

to sol...@googlegroups.com

The initial image analysis (firecrest) is parralelizable. The
alignment portion is not.

Data transfer has not been an issue with us, the images are transfered
during chemistry cycles. We have a gigabit network and it has not
caused any issues.

Simon

unread,

Jun 13, 2009, 6:41:30 PM6/13/09

to solexa

It's worth bearing in mind that as the Illumina software develops,
more of the image analysis, intensity file generation and base calling
is being done by the instrument PC itself. The latest software (SCS
2.4) does this real-time analysis, a current limitation being that it
is not unable to handle control lanes.

What you need for analysis post base calling will be governed by the
type of projects you are running (digital gene expression, re-
sequencing, de novo assembly etc). A lot of our work involves de novo
assemblies, for which we find our 80 core cluster adequate at present
(range of 8-32 Gb memory machines, although we have just acquired a
128Gb machine to cope with some of the larger assemblies).

As for image transfer etc, we're on the verge of adopting the Illumina
paradigm and deleting images immediately after analysis. We're
currently using the (now redundant) IPAR unit as a staging post to
keep images from the most recent run only, once we know we have good
data - bye bye images.

On Jun 12, 11:31 pm, Andrew Gagne <aga...@gmail.com> wrote:
> The initial image analysis (firecrest) is parralelizable. The
> alignment portion is not.
>
> Data transfer has not been an issue with us, the images are transfered
> during chemistry cycles. We have a gigabit network and it has not
> caused any issues.
>

> On Fri, Jun 12, 2009 at 5:33 PM, Duke<duke.li...@gmx.com> wrote:
> > On 6/12/09 5:12 PM, Andrew Gagne wrote:
>
> > We utilize our school's cluster for analysis.
>
> > Perks - lots of cpus
> > - cheaper
> > - redundant systems
> > - someone else administers systems
>
> > Thanks Andrew for sharing your experience. How do you take advantages of
> > lots of CPUs if, as I heard of, solexa software (the pipeline software) that
> > Illumina offers does not have a parallelized version? Or you write your own
> > codes to do the analysis?
>
> > cons - no large memory machines. the largest we have are 32 GB
> > - not in total control over system, others may trump our jobs
>
> > How about data transferring if you use cluster system? Is it a problem if
> > considering that sequencing data is quite large (some tens to hundreds GB)?
>
> > Thanks,
>
> > D.
>
> > overall we have not had too much problem, although as we gain more
> > instruments it may turn into a bottleneck

Clive Brown

unread,

Jun 14, 2009, 3:07:33 PM6/14/09

to sol...@googlegroups.com

GA-Pipeline used to run under sun grid engine and thus was parallel.

Date: Fri, 12 Jun 2009 17:33:59 -0400
From: duke....@gmx.com
To: sol...@googlegroups.com
Subject: Re: solexa IT infrastructure advice

Andrew Gagne

unread,

Jun 14, 2009, 3:17:33 PM6/14/09

to sol...@googlegroups.com

We are switching to the new SCS software, however, until it uses
control lanes we're going to continue running firecrest.

We use LSF for our job scheduler.

Kevin M. Carr

unread,

Jun 14, 2009, 4:07:09 PM6/14/09

to Solexa User Group

There is no reason to re-run Firecrest, the image analysis step which
generates the intensity and noise files. You will only need to re-run
Bustard, the base caller. The control lane information is needed to
properly calculate the cross-talk matrix and phasing/prephasing numbers.
These numbers are used during the base calling step. In essence it will be
like using the IPAR data.

Kevin M. Carr

**************************
Bioinformatics Specialist
Research Technology
Support Facility
S20-A Plant Biology Lab
Michigan State University
East Lansing, MI 48824

Ph: (517) 355-6759 x102
Fax:(517) 355-6758
**************************

> From: Andrew Gagne <aga...@gmail.com>
> Reply-To: Solexa User Group <sol...@googlegroups.com>
> Date: Sun, 14 Jun 2009 15:17:33 -0400
> To: Solexa User Group <sol...@googlegroups.com>
> Subject: Re: solexa IT infrastructure advice
>

Duke

unread,

Jun 15, 2009, 9:34:42 AM6/15/09

to sol...@googlegroups.com

We also have a Gigabit network here with unlimited network storage, so I think we will be OK. But I am not quite sure I understand the advantage of have ability to handle control lanes. Why do we need that ability?

Thanks,

D.

Duke

unread,

Jun 15, 2009, 9:37:35 AM6/15/09

to sol...@googlegroups.com

Really? So do they offer parallel version now? I read somewhere (can not recall it) that they do not.

Duke

unread,

Jun 15, 2009, 9:48:18 AM6/15/09

to sol...@googlegroups.com

We are also buying UPS for our coming system. The system we have
includes the GA-IIx and a Pipeline Analysis Server (four dual core
3.4GHz with 32 GB of memory). In the Sequencing Site Preparation Guide
(Rev. # April 2009), it is recommended that we have the APS Smart-UPS
Model SUA3000 for the GA-IIx and the APS Smart-UPS Model SUA3000RM2U.

However, the Illumina Techie (or saleman?) suggested us purchase the APC
Smart-UPS SUA2200 for the GA-IIx "to save money". Of course we want to
save money, but we dont want to buy something that can run our system
few minutes and then stop or be broken. Anybody here has a new system
(GA-IIx and Pipeline Server) can give us some advice and experience?

Thanks,

D.

Andrew Gagne

unread,

Jun 15, 2009, 10:19:07 AM6/15/09

to sol...@googlegroups.com

The software has always been parallel, its a bunch of scripts linked
together via make. If you look at the documentation there is a section
(Appendix C Using Parallelization in Pipeline) that discusses this.

Clive Brown

unread,

Jun 15, 2009, 10:25:59 AM6/15/09

to sol...@googlegroups.com

which fits my recollection of earlier versions - although i am somewhat out of date now having not looked at the software for well over a year.

c.

> Date: Mon, 15 Jun 2009 10:19:07 -0400

> Subject: Re: solexa IT infrastructure advice

> From: aga...@gmail.com
> To: sol...@googlegroups.com

Kevin M. Carr

unread,

Jun 15, 2009, 10:52:25 AM6/15/09

to Solexa User Group

I think two different meanings of parallel are getting conflated here. Yes,
out of the box you can parallelize the pipeline on an SMP/shared memory
machine using "make -j N" where N equals the number of cpus you wish to use.

If you want to use a cluster you have to have SGE installed and configured
with an appropriate parallel environment and run your pipeline jobs using
qmake. The documentation merely mentions that this is an option for running
the pipeline on a cluster, it offers no guidance on how to do it.

Kevin M. Carr

**************************
Bioinformatics Specialist
Research Technology
Support Facility
S20-A Plant Biology Lab
Michigan State University
East Lansing, MI 48824

Ph: (517) 355-6759 x102
Fax:(517) 355-6758
**************************

> From: Clive Brown <clive_...@hotmail.co.uk>
> Reply-To: Solexa User Group <sol...@googlegroups.com>
> Date: Mon, 15 Jun 2009 15:25:59 +0100
> To: Solexa User Group <sol...@googlegroups.com>

Andrew Gagne

unread,

Jun 15, 2009, 11:05:49 AM6/15/09

to sol...@googlegroups.com

Ah, sorry for causing confusion.

Our sysadmin's set up our system, we are using LSF. So it is possible
to use something other than SGE. We us distmake rather than qmake.

-andrew

Duke

unread,

Jun 15, 2009, 11:39:46 AM6/15/09

to sol...@googlegroups.com

On 6/15/09 10:52 AM, Kevin M. Carr wrote:

If you want to use a cluster you have to have SGE installed and configured
with an appropriate parallel environment and run your pipeline jobs using
qmake.  The documentation merely mentions that this is an option for running
the pipeline on a cluster, it offers no guidance on how to do it.

Yes, using Cluster is what I meant. If there is no documents on that, then how would we install and configure the software?

By the way, since I am new so I am quite fool with some of the abbreviations. What is LSF that Andrew told about? And does SGE mean Sun Grid Engine?

Thanks,

D.

Andrew Gagne

unread,

Jun 15, 2009, 11:42:33 AM6/15/09

to sol...@googlegroups.com

Yes SGE is sun grid engine, LSF is another job scheduler.(
http://en.wikipedia.org/wiki/Platform_LSF )

If you have questions about using LSF with the pipeline software I can
put you in touch with our sysadmins.

-andrew

Clive Brown

unread,

Jun 15, 2009, 1:15:47 PM6/15/09

to sol...@googlegroups.com

people have used LSF instead of SGE -- e.g. WashU. -- a script for this used to be available from David Dooling.

qmake instead of make, mostly covered by the SGE docs. Your make file needs to be compatible. It certainly used to be.

> Date: Mon, 15 Jun 2009 11:05:49 -0400

Tom Skelly

unread,

Jun 15, 2009, 7:03:43 PM6/15/09

to sol...@googlegroups.com

A couple points...

The future of the GA2 seems to be RTA, the Real-Time Analysis function
which does image processing and (optionally) base-calling on the
instrument PC. That replaces IPAR (a failed experiment, apparently), and
also eliminates the need for a large cluster for post-run
images-to-basecalls processing.

I'm interested to hear you say you worry about RTA not using the control
lane. The control lane provides four things: cross-talk correction
parameters, phasing correction parameters, a base score recalibration
table, and error rate estimates. Our (Sanger) experience is that you can
use the lane's own data for crosstalk and phasing and never see a
difference. Base score recalibration is a very thorny issue -- but that
goes on in the Gerlad step which, RTA or not, you do offline, post-run.
Ditto error rate estimation.

FWIW, the post-run pipeline is "parallelisable" in two ways. You get
"make -j N" for free if you are running on a multi-CPU computer. Make
will figure out how best to use the CPUs. You can also split most of the
processing out on a lane by lane basis. So if you have a cluster of
8-CPU nodes, you can run 8 lanes * 8 CPUs = 64 for much of the time.
There are a few choke points which can't be split out by lane -- but
most of the processing can be.

But keep in mind, if you're just starting, focus your efforts on not
needing that cluster -- at least not for primary post-run analysis.

Tom Skelly

mitch

unread,

Jun 15, 2009, 7:53:04 PM6/15/09

to solexa

a patch for allowing a control lane to be nominated is due for release
soon (according to my FAS)

Illumina was pressed for time to release the new software.

M

> > > >> From: Clive Brown <clive_g_br...@hotmail.co.uk>
> > > >> Reply-To: Solexa User Group <sol...@googlegroups.com>
> > > >> Date: Mon, 15 Jun 2009 15:25:59 +0100
> > > >> To: Solexa User Group <sol...@googlegroups.com>
> > > >> Subject: RE: solexa IT infrastructure advice
>
> > > >> which fits my recollection of earlier versions - although i am
> > somewhat out of
> > > >> date now having not looked at the software for well over a year.
>
> > > >> c.
>
> > > >>> Date: Mon, 15 Jun 2009 10:19:07 -0400
> > > >>> Subject: Re: solexa IT infrastructure advice
> > > >>> From: aga...@gmail.com
> > > >>> To: sol...@googlegroups.com
>
> > > >>> The software has always been parallel, its a bunch of scripts
> > linked
> > > >>> together via make. If you look at the documentation there is a
> > section
> > > >>> (Appendix C Using Parallelization in Pipeline) that discusses
> > this.
>

> > > >>> On Mon, Jun 15, 2009 at 9:37 AM, Duke<duke.li...@gmx.com> wrote:
> > > >>>> Really? So do they offer parallel version now? I read somewhere
> > (can not
> > > >>>> recall it) that they do not.
>
> > > >>>> On 6/14/09 3:07 PM, Clive Brown wrote:
>
> > > >>>> GA-Pipeline used to run under sun grid engine and thus was
> > parallel.
>
> > > >>>> ________________________________
> > > >>>> Date: Fri, 12 Jun 2009 17:33:59 -0400

> > > >>>> From: duke.li...@gmx.com

> > > >>>> On Fri, Jun 12, 2009 at 3:14 PM, Duke<duke.li...@gmx.com>

> > > >>>> D.- Hide quoted text -
>
> - Show quoted text -

Duke

unread,

Jun 16, 2009, 1:09:14 AM6/16/09

to sol...@googlegroups.com

Hi all again,

To my understand, the old version of Illumina sequencer is the GA (or GA-II) together with the IPAR (Integrated Primary Analysis and Reporting), and it is recommended by the Site Preparation Guide that we use APS Smart-UPS SUA3000RM2U. Can you advise me what UPS you are using for your GA and how it performs (especially with long run to a few days)?

Thanks,

D.

Duke

unread,

Jun 16, 2009, 1:12:35 AM6/16/09

to sol...@googlegroups.com

Hi again,

On 6/15/09 1:15 PM, Clive Brown wrote:

people have used LSF instead of SGE -- e.g. WashU. -- a script for this used to be available from David Dooling.

Is there a specific reason why people use LSF instead of SGE (like advantages or disadvantages)?

Thanks,

KC.

Clive Brown

unread,

Jun 16, 2009, 3:33:39 AM6/16/09

to sol...@googlegroups.com

I think its just preference and prior investment.

Date: Tue, 16 Jun 2009 01:12:35 -0400

Andrew Gagne

unread,

Jun 16, 2009, 6:47:30 AM6/16/09

to sol...@googlegroups.com

Clive is correct, we have an existing LSF cluster. I mentioned it only
because it is unsupported by Illumina, but we have not had any issues
with it.

hemant kelkar

unread,

Jun 16, 2009, 9:13:31 AM6/16/09

to sol...@googlegroups.com

Duke,

I agree with Tom. If you are only going to have one illumina sequencer then you should be able to do all you need to do with a single dual quad core xeon server (e.g. a Dell 2950 but you can obviously use a similar offering from your favorite vendor) with 16 or 32GB of RAM.

Even if you expect to run your illumina machine 24 x 7, it still takes 2 days to complete a 36 bp run. With human
samples in all 8 lanes you should need 12 hr (max) to complete base calling and alignments on this server (if you choose not to base call because of the control lane issue with SCS mentioned in other responses).

We have been supporting three illumina sequencers using such a set up for the last year and a half.

Having a spare identical server would allow you to test and switch to new versions of pipeline seamlessly.

Hemant

David Dooling

unread,

Jun 16, 2009, 9:29:13 AM6/16/09

to sol...@googlegroups.com

On Mon, Jun 15, 2009 at 06:15:47PM +0100, Clive Brown wrote:
> people have used LSF instead of SGE -- e.g. WashU. -- a script for
> this used to be available from David Dooling.

http://genome.wustl.edu/pub/software/lsgmake-gap/

--
David Dooling

David Dooling

unread,

Jun 16, 2009, 9:30:30 AM6/16/09

to sol...@googlegroups.com

On Tue, Jun 16, 2009 at 01:12:35AM -0400, Duke wrote:
> On 6/15/09 1:15 PM, Clive Brown wrote:
> > people have used LSF instead of SGE -- e.g. WashU. -- a script for
> > this used to be available from David Dooling.
>
> Is there a specific reason why people use LSF instead of SGE (like
> advantages or disadvantages)?

LSF scales much better than sge. If your cluster needs to handle tens
of thousands of jobs simultaneously, sge or PBS (and its descendants)
will choke.

--
David Dooling

Duke

unread,

Jun 16, 2009, 10:29:24 AM6/16/09

to sol...@googlegroups.com

Hi again,

I think I got confused a little bit here. The system we are buying from Illumina includes: a GA-IIx with Control Computer, a Cluster Station with Control Computer, a Pair-End Module and a Pipeline Analysis Server. What I thought before was that the Computer controls GA-IIx and the image data will be transfered to the Pipeline Analysis Server and the initial image analysis as well as the base calling would be processed on the Server. And the downstream analysis, depending on what we want to do, will need another powerful computer or take advantages of the cluster.

It turns out to me that actually the image analysis and base calling will be controlled by SCS on the instrumental computer, then the data (text files) will be sent to the Server for further downstream analysis. Anyone can correct that for me?

Thanks,

D.

hemant kelkar

unread,

Jun 16, 2009, 11:05:28 AM6/16/09

to sol...@googlegroups.com

Duke,

From your description below it seems that you are buying a "pipeline" server from Illumina as part of the package.
So you are going to end up with analysis ready sequence files at the end of a run, either using that server or the GA-IIx control computer.

With Pipeline 1.4 and SCS 2.4, all the analysis (up to and including base calling) can be done by the GA-IIx control computer. You can opt to do things differently at various steps in this process.

"Downstream" analysis in your case would be post-pipeline analysis. At that point you can use any kind of compute infrastructure. You can also use the "pipeline" server for this purpose (I assume it is a linux server). It would essentially depend on what kind of analysis you want to do and what code/programs you are going to use for the analysis.

Hemant

Clive Brown

unread,

Jun 16, 2009, 11:09:41 AM6/16/09

to sol...@googlegroups.com

It sounds like, from what Tom was saying, that its all irrelevant anyway as the new configurations can do the image processing on the instrument computer (finally). Whats left, unless you are running zillions of GAs on big whole genomes, shouldn't be that big a deal either for compute or parallelisation on a standard smallish cluster or SMP server.

FYI: The early versions of the GA-Pipeline were developed and tested on SGE with, as I recall, about 12 instruments (allbeit GAIs). (Has lsf become free yet - because the cost was a major turn off at the time ?) -- again though - generally, im very out of date on this stuff - Tom is a better source of info - please do pester him, I know he likes it.

c.

> Date: Tue, 16 Jun 2009 08:30:30 -0500
> From: ddge...@gmail.com

> To: sol...@googlegroups.com
> Subject: Re: solexa IT infrastructure advice
>
>

Duke

unread,

Jun 16, 2009, 6:57:20 PM6/16/09

to sol...@googlegroups.com

Hi Hemant,

Yes, you got it right. The pipeline server (four quad-core 2.93GHz with 32GB of memory and 9TB of storage) comes with the analyzer as an additional option to the package we ordered from Illumina. As they told me, the pipeline 1.4 will be pre-installed on the Server, and the SCS 2.4 will be on the instrument control computer.

The post-pipeline analysis we are planning to do (mostly) is RNA sequencing (to find out new RNA for example). Anybody can give me some advice on the compute infrastructure requirements that we should get? Will the pipeline server or our institutional cluster be enough for that purpose?

Thanks,

D.

Benjamin Berman

unread,

Jun 16, 2009, 7:19:38 PM6/16/09

to sol...@googlegroups.com

Hi Duke,

I tend to think that if you have some institutional cluster support,
that might be a good way to go. I am assuming that the Illumina-
bundled machine is an HP server running some flavor of RedHat Linux.
Whether you run just Illumina GA-pipeline 1.4 or (very likely) some of
the many 3rd party tools such as BWA/SAMtools, SOAP2, Bowtie, Mosaik,
Velvet, etc., most will run fine on your Linux cluster (as discussed,
GA-pipeline runs fine on LSF, SGE, and probably PBS?). The one
exception might be large whole genome de novo assemblies which might
need more than 32GB RAM (does anyone here have direct experience with
de novo assembly of transcriptomes?). The cluster tends to be a more
cost-effective solution than a standalone server, plus if you have
help with the system administration and storage/backup, that is a big
win. I worry that with the Illumina-bundled server, you might run out
of space with 9TB more quickly than you would like.

ben.

Duke

unread,

Jun 17, 2009, 12:06:43 AM6/17/09

to sol...@googlegroups.com

Hi Ben,

Thanks for your input. Yes the server we are getting is a HP ProLiant DL580 G5 and is shipped with Redhat Linux Server. If using cluster is a clear way to go, then I think we will definitely go that way. Our institutional cluster currently has one node with 96 GB memory and the IT department is going to have some upgrades to have some new nodes with more memory, so hopefully memory will not be a big issue for us.

I did think of our storage capacity, but our institute has unlimited online storage so that we can store our data after post-pipeline analysis. Keeping all the images after sequence's run for some weeks could be a problem for us, not because the storage limitation, but the data transfer since I am not sure our 1Gb/s network will capable of transferring around 1TB image files (for example, just one run) without any corruption or broken data.

Anybody here ever keeps image files over the network like that, or you simple store them locally on your own standalone server?

Thanks,

D.

Benjamin Berman

unread,

Jun 17, 2009, 12:17:25 AM6/17/09

to sol...@googlegroups.com

Boy , i would love to have "unlimited" storage capacity! Anyway, if you have a gigabit connection, you should be able to transfer images from 1 or 2 GAIIX machines without much problem (each machine will transfer about 4-6MB/s continuously for the length of the run.)

ben.

Simon Foster

unread,

Jun 17, 2009, 3:58:45 AM6/17/09

to sol...@googlegroups.com

We routinely delete images once we know we have good data from the run. We implemented this probably about a year ago and have never looked back or been in a situation where we regretted deleting images. In the absence of unlimited storage, for us the storage costs for the images would be more than the costs of running a flowcell again should we have need to.

2009/6/17 Benjamin Berman <ben...@gmail.com>

Benjamin Berman

unread,

Jun 17, 2009, 4:19:15 AM6/17/09

to sol...@googlegroups.com

Yes, same with us. The recent release of Firecrest 1.4 was the one occasion where we would have liked to have more backlogged images, but it seems unlikely that we'll see another change of that magnitude anytime soon. I wish the GA pipeline had an option to save a nice sampling of images or image fragments (maybe 5% of total area) that could be used for troubleshooting. We can do this ourselves, but it's the kind of thing everyone could benefit from...

ben.

Chris Dagdigian

unread,

Jun 17, 2009, 3:22:53 PM6/17/09

to solexa

My $.02 concerning the SGE scaling comments by David ...

(1) SGE 6.2 design goal includes supporting a single array job with
500,000 tasks and hundreds of thousands of concurrent jobs
(2) People have been running hundreds of thousands of SGE jobs per
week since the SGE 5.3 days many years ago
(3) I personally know of several sites pushing hundreds of thousands
of heavy SGE jobs per week through their systems right now
(4) SGE 6.2 runs a 62,000 core cluster in Texas (RANGER) and has been
for some time

"tens of thousands of jobs" is actually pretty easy with Grid Engine
and has been for some time, scaling issues encountered in this range
have more to do with bad spooling decisions, filesystem design and
occasionally an overwhelmed qmaster host. The developers have worked
quite a bit this year to improve threading performance, reduce memory
footprints and remove things like external RSH methods that consumed
system resources like filehandles and TCP ports etc.

This is especially evident in the SGE 6.2 and 6.2u1 release series
where speed and scaling were specifically addressed as part of the
design effort (6.2u3 and 6.3 will introduce new features). This is the
reason why the SGE scheduler is now a thread within the qmaster - one
of the more obvious user-visible changes made recently.

There are many reasons why one would chose between LSF vs SGE (I have
used both for years now) but scaling is not one of the significant
selection factors. Features, price, APIs and quality of documentation
are far more important along with community adoption/support.

-Chris
(disclaimer and bias alert: employed by BioTeam and I also write the
gridengine.info blog)

Duke

unread,

Jun 17, 2009, 4:11:19 PM6/17/09

to sol...@googlegroups.com

How would you know that you have good data from one run so that you can delete the images? If there is no nice sampling of images like Ben suggested, then what we can compare our run to?

I do not really trust our CIT Dept. when they claimed we have "unlimited" storage, and I am going to do some tests to check that. But to my understand, they have a like of "dynamic storage" which can be automatically (or easily) upgraded once we reach the capacity limit.

D.

Andrew Gagne

unread,

Jun 17, 2009, 4:18:38 PM6/17/09

to sol...@googlegroups.com

I would definitely let them know the rate and quantity of data you
will be generating.

Benjamin Berman

unread,

Jun 17, 2009, 5:47:21 PM6/17/09

to sol...@googlegroups.com

Just to clarify a little on the confusing control lane issue. Here is
what the official use guide says (p17)

"For samples with biased base compositions, as encountered in many tag-
based (for example, Digital Gene Expression) or microRNA applications,
auto-calibration does not provide perfect results. For such samples,
you need to dedicate one lane of the flow cell to a control sample and
use the -- control-lane command option to generate analysis parameters."

As Tom said, for many or most kinds of samples, you probably don't
need a control lane. But my understanding is that the cross-talk
estimates use a model that is inappropriate for samples with biased
base composition. In addition to those listed above (DGE, microRNA),
you could also add bisulfite sequencing and probably some IP
enrichment samples. I haven't seen any head-to-head comparisons of
errors of these types of samples with and without control lane lately,
but you do so at your own risk. Since Illumina is apparently working
on a solution for RTA, this shouldn't be an issue long term.
Sometimes i really question the strategy of releasing lots of new
features before you've even addressed old ones.

ben.

David Dooling

unread,

Jun 18, 2009, 11:09:57 AM6/18/09

to sol...@googlegroups.com

On Wed, Jun 17, 2009 at 12:22:53PM -0700, Chris Dagdigian wrote:
> My $.02 concerning the SGE scaling comments by David ...

Thanks for the information. I must admit, it was several years ago
that we evaluated sge (and several other batch schedulers). At the
time, LSF was the only one that could meet our needs. Given that LSF
is pricey, it is good to hear sge has improved. Presently, we are
looking at Condor as a replacement for LSF.

http://www.cs.wisc.edu/condor/

Reply all

Reply to author

Forward