Illumina pipeline changes

Sean Davis

unread,

Feb 19, 2008, 7:16:59 AM2/19/08

to Grid Engine Life Science SIG

All

For those who haven't heard, I thought I would update you on where the
Genome Analyzer software and hardware are headed (after an informal
meeting with Illumina last week). The software pipeline will very
soon be phased into a maintenance-only phase (no new development) and
new Genome Analyzers will come packaged with a server (called IPAR, I
believe) that will do all the work up through base-calling. The
images will still be available but will not routinely be dumped from
the IPAR machine. So, sequence will be the output of the Genome
Analyzer and the pipeline, as it currently exists, will no longer be
needed. Folks with an existing GA setup can upgrade to the new
equipment (not free, unfortunately).

This has the potential to very significantly decrease the IT
complexity on both the storage front as well as the compute power
front. However, for those groups who want/need to continue to develop
customizations to the pipeline, this seems to signal that such efforts
may become somewhat more difficult. The details are not at all clear
to me yet how this will occur in practice, though. And I admit that I
am not at all in the inner circle, so I don't have many details to
give.

Sean

P.S. I have no relationship whatsoever with Illumina. I would
encourage folks to confirm details with your local Illumina Rep before
making changes to local installations (either growing them or reducing
them).

Victor Ruotti

unread,

Feb 20, 2008, 6:16:08 PM2/20/08

to Grid Engine Life Science SIG

Hi Sean,
Interesting. Did you happen to hear the specs on Illumina's IPAR
server?
Memory, processors, etc?

I would assume people would want to backup the images also, so I guess
this is something to think about.
Maybe transfer the sequences that were processed by IPAR so the
pipeline can continue with the alignment outside of IPAR. But, also at
some point, have a storage solution for the images? Are you currently
keeping images? How long?

Victor

Sean Davis

unread,

Feb 20, 2008, 6:39:30 PM2/20/08

to Victor Ruotti, Grid Engine Life Science SIG

On Wed, Feb 20, 2008 at 6:16 PM, Victor Ruotti <victor...@gmail.com> wrote:
>
> Hi Sean,
> Interesting. Did you happen to hear the specs on Illumina's IPAR
> server?
> Memory, processors, etc?

It is a box that comes from Illumina and is apparently "about the size
of a microwave or small refrigerator". Otherwise, I do not know any
more specs. It will run during the chemistry cycles, so there will no
longer a wait it seems.

>
> I would assume people would want to backup the images also, so I guess
> this is something to think about.
>
> Maybe transfer the sequences that were processed by IPAR so the
> pipeline can continue with the alignment outside of IPAR. But, also at
> some point, have a storage solution for the images? Are you currently
> keeping images? How long?

The storage solution for images is LARGE, no matter what, and the
amount of data will be increasing with the upgrade/new machines. We
are keeping images using a 750Gb SATA disk attached to a USB-to-SATA
connector. We simply drop the images there, disconnect the disk, and
drop it on a shelf. This isn't bomb-proof, but we don't really think
that we are going to go back to the images often. We have also toyed
with image compression, but I don't have much to say there yet--this
needs to be done in a custom way to afford any compression.

Sean

Victor Ruotti

unread,

Feb 20, 2008, 9:53:40 PM2/20/08

to Grid Engine Life Science SIG

Thanks for the quick update.
Please keep up posted if you know more about this.
Best,
Victor

c...@sanger.ac.uk

unread,

Apr 16, 2008, 9:30:43 AM4/16/08

to Grid Engine Life Science SIG

Yes, we had this story at Sanger. It came as a great shock, because we
think it is entirely inappropriate for people running large numbers of
machine, who want to QC, refine and analyse data in some depth. It
does not decrease IT compexity when you are running 30 systems in a
factory - it increases it.

We have pushed back and are not buying iPar. We were also initially
told it was compulsory. that didnt go down very well as we are their
biggest customer. I think other genome centres are doing the same for
the same reasons.

We have been told pipeline development will continue, but im not
feeling confident about that given the rate if innovation seems to
have ground to a halt.

Our answer at sanger is to write and provide another free illumina
analysis pipeline via sourceforge - along with the other downstream
aligners and assemblers we have developed. We only require access to
images, and this is guaranteed by illumina, we would like it to be a
condition of purchase/upgrade.

Timothy Brown

unread,

Apr 16, 2008, 10:02:51 AM4/16/08

to Grid Engine Life Science SIG

We have been ramping up our genomic lab and recently purchased 4 new
GA's from Illumina making our total 5, we are also retro fitting the
old (6 months!) one with ipar.

These machines have come on-line in the last 2 weeks and we are still
doing QA.

The GA is still controlled by the data collection software (SCS?) on a
Dell tower. While the ipar system is on a HP DL380 G5 with a MSA70
disk array. Ipar is running on Windows.

I have been trying to get more information out of Illumina regarding
the post-processing (pipeline) procedure now. It seems they don't
really know themselves! Saying ipar implements firecrest and a new
version of the pipeline will be able to use these intensities rather
than the images. So then we would start the pipeline from base-calling.

We are looking at getting more systems (Illumina, AB, ... it's still
open), but I totally agree managing more machines in this environment
is going to be harder.

We have a nice big cluster (running SGE) and would like to get
involved (if only as dumb users!) in the sourceforge project.

Thanks
Timothy

--
Timothy Brown
IT Architect/Senior Unix System Administrator

Ontario Institute for Cancer Research
MaRS Centre, South Tower
101 College Street, Suite 500
Toronto, Ontario, Canada M5G 0A3

Tel: 416-673-8532
Toll-free: 1-866-678-6427
www.oicr.on.ca

This message and any attachments may contain confidential and/or
privileged information for the sole use of the intended recipient.
Any review or distribution by anyone other than the person for whom
it was originally intended is strictly prohibited. If you have received
this message in error, please contact the sender and delete all
copies. Opinions, conclusions or other information contained in this
message may not be that of the organization.

Sean Davis

unread,

Apr 16, 2008, 12:03:18 PM4/16/08

to c...@sanger.ac.uk, Grid Engine Life Science SIG

On Wed, Apr 16, 2008 at 9:30 AM, <c...@sanger.ac.uk> wrote:
>
> Yes, we had this story at Sanger. It came as a great shock, because we
> think it is entirely inappropriate for people running large numbers of
> machine, who want to QC, refine and analyse data in some depth. It
> does not decrease IT compexity when you are running 30 systems in a
> factory - it increases it.
>
> We have pushed back and are not buying iPar. We were also initially
> told it was compulsory. that didnt go down very well as we are their
> biggest customer. I think other genome centres are doing the same for
> the same reasons.
>
> We have been told pipeline development will continue, but im not
> feeling confident about that given the rate if innovation seems to
> have ground to a halt.
>
> Our answer at sanger is to write and provide another free illumina
> analysis pipeline via sourceforge - along with the other downstream
> aligners and assemblers we have developed. We only require access to
> images, and this is guaranteed by illumina, we would like it to be a
> condition of purchase/upgrade.

We at NCI would certainly be interested in such a project and
solution. We had already made some modifications to the pipeline (at
the C++ level) that we think improve the usability of the current
pipeline; the prospect of maintaining those changes on IPAR was nil.
So, operationally, what is the way forward here for those wishing to
contribute? Has a sourceforge site been established? Were you
planning on starting from the existing pipeline (don't have any idea
about licensing issues here) or starting from scratch?

Thanks,
Sean

Clive G. Brown

unread,

Apr 16, 2008, 12:15:32 PM4/16/08

to Sean Davis, Grid Engine Life Science SIG

We're just cracking on with it.

The idea is to rewrite from scratch - we have a good understanding of it of
course. A single binary (probably) in a platform agnostic language(s) - that
takes a tile or tile pattern and other parameters on the command line and
processes, end-to-end, on the per-tile level. This inherently then allows
you to parallelize up to your number of tiles (where tiles ~ cores). We
think it can and should take far less than 1 minute to process a full 37
cycle image stack (tile) through to calibrated base calls and IO them into a
file. It should also be possible to align 20-30,000 reads from a single tile
in less than several seconds. Parallelisation is thus a wrapper script that
forks a process with an input tile-pattern and target runfolder - should
thus work on SGE, LSF, Mosix or even an SMP iPar like box etc.

The only required input should the the image(s) for a tile - everything else
can be computed from that - or from once-only global lookup tables.

The first implementation will work this way - we will be mindful of enabling
a cycle-wise analysis for those who see a benefit there.

The concept is called "swift" - its is largely driven by Nava Whiteford and
Tom Skelley (in my group).

Ultimately, I would like to wrap the binary with a web service and run it on
the 'spare' cores on the instrument PC - with remote control + web GUI. At
sub 1 minute per tile a complete analysis can be done in 44 cpu hours, which
is less than the run time of even the GAII - thus the analysis can easily be
done on 1 core in near-real-time on the PC. Likewise - you can analyse the
last runfolder - on the instrument PC - whilst collecting the current one.
Post run you can complete an analysis of an entire runfolder in 1 minute
(albeit on 330 * 8 cores) - or more realistically in 40 minutes or so on a
reasonable cluster.

Should be a runnable version by end of quarter.

C.

--
Clive G. Brown
Group Leader, (Next-Gen) Sequencing Informatics R&D
Wellcome Trust Sanger Institute
Hinxton Genome Campus
Hinxton
Cambridge
UK
CB10 1SA

+44(0) 1223 834244 Ext: 5343
OR
+44(0) 1223 495343 (Direct)

http://ical.mac.com/clive_g_brown/Clive%20G.%20Brown

http://www.linkedin.com/pub/3/533/a37

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Clive G. Brown

unread,

Apr 16, 2008, 12:31:43 PM4/16/08

to Sean Davis, Grid Engine Life Science SIG

Stuff will also appear on

Www.genographia.com

On 16/4/08 17:03, "Sean Davis" <sda...@mail.nih.gov> wrote:

--

Reply all

Reply to author

Forward