For those who haven't heard, I thought I would update you on where the
Genome Analyzer software and hardware are headed (after an informal
meeting with Illumina last week). The software pipeline will very
soon be phased into a maintenance-only phase (no new development) and
new Genome Analyzers will come packaged with a server (called IPAR, I
believe) that will do all the work up through base-calling. The
images will still be available but will not routinely be dumped from
the IPAR machine. So, sequence will be the output of the Genome
Analyzer and the pipeline, as it currently exists, will no longer be
needed. Folks with an existing GA setup can upgrade to the new
equipment (not free, unfortunately).
This has the potential to very significantly decrease the IT
complexity on both the storage front as well as the compute power
front. However, for those groups who want/need to continue to develop
customizations to the pipeline, this seems to signal that such efforts
may become somewhat more difficult. The details are not at all clear
to me yet how this will occur in practice, though. And I admit that I
am not at all in the inner circle, so I don't have many details to
give.
Sean
P.S. I have no relationship whatsoever with Illumina. I would
encourage folks to confirm details with your local Illumina Rep before
making changes to local installations (either growing them or reducing
them).
It is a box that comes from Illumina and is apparently "about the size
of a microwave or small refrigerator". Otherwise, I do not know any
more specs. It will run during the chemistry cycles, so there will no
longer a wait it seems.
>
> I would assume people would want to backup the images also, so I guess
> this is something to think about.
>
> Maybe transfer the sequences that were processed by IPAR so the
> pipeline can continue with the alignment outside of IPAR. But, also at
> some point, have a storage solution for the images? Are you currently
> keeping images? How long?
The storage solution for images is LARGE, no matter what, and the
amount of data will be increasing with the upgrade/new machines. We
are keeping images using a 750Gb SATA disk attached to a USB-to-SATA
connector. We simply drop the images there, disconnect the disk, and
drop it on a shelf. This isn't bomb-proof, but we don't really think
that we are going to go back to the images often. We have also toyed
with image compression, but I don't have much to say there yet--this
needs to be done in a custom way to afford any compression.
Sean
These machines have come on-line in the last 2 weeks and we are still
doing QA.
The GA is still controlled by the data collection software (SCS?) on a
Dell tower. While the ipar system is on a HP DL380 G5 with a MSA70
disk array. Ipar is running on Windows.
I have been trying to get more information out of Illumina regarding
the post-processing (pipeline) procedure now. It seems they don't
really know themselves! Saying ipar implements firecrest and a new
version of the pipeline will be able to use these intensities rather
than the images. So then we would start the pipeline from base-calling.
We are looking at getting more systems (Illumina, AB, ... it's still
open), but I totally agree managing more machines in this environment
is going to be harder.
We have a nice big cluster (running SGE) and would like to get
involved (if only as dumb users!) in the sourceforge project.
Thanks
Timothy
--
Timothy Brown
IT Architect/Senior Unix System Administrator
Ontario Institute for Cancer Research
MaRS Centre, South Tower
101 College Street, Suite 500
Toronto, Ontario, Canada M5G 0A3
Tel: 416-673-8532
Toll-free: 1-866-678-6427
www.oicr.on.ca
This message and any attachments may contain confidential and/or
privileged information for the sole use of the intended recipient.
Any review or distribution by anyone other than the person for whom
it was originally intended is strictly prohibited. If you have received
this message in error, please contact the sender and delete all
copies. Opinions, conclusions or other information contained in this
message may not be that of the organization.
We at NCI would certainly be interested in such a project and
solution. We had already made some modifications to the pipeline (at
the C++ level) that we think improve the usability of the current
pipeline; the prospect of maintaining those changes on IPAR was nil.
So, operationally, what is the way forward here for those wishing to
contribute? Has a sourceforge site been established? Were you
planning on starting from the existing pipeline (don't have any idea
about licensing issues here) or starting from scratch?
Thanks,
Sean
The idea is to rewrite from scratch - we have a good understanding of it of
course. A single binary (probably) in a platform agnostic language(s) - that
takes a tile or tile pattern and other parameters on the command line and
processes, end-to-end, on the per-tile level. This inherently then allows
you to parallelize up to your number of tiles (where tiles ~ cores). We
think it can and should take far less than 1 minute to process a full 37
cycle image stack (tile) through to calibrated base calls and IO them into a
file. It should also be possible to align 20-30,000 reads from a single tile
in less than several seconds. Parallelisation is thus a wrapper script that
forks a process with an input tile-pattern and target runfolder - should
thus work on SGE, LSF, Mosix or even an SMP iPar like box etc.
The only required input should the the image(s) for a tile - everything else
can be computed from that - or from once-only global lookup tables.
The first implementation will work this way - we will be mindful of enabling
a cycle-wise analysis for those who see a benefit there.
The concept is called "swift" - its is largely driven by Nava Whiteford and
Tom Skelley (in my group).
Ultimately, I would like to wrap the binary with a web service and run it on
the 'spare' cores on the instrument PC - with remote control + web GUI. At
sub 1 minute per tile a complete analysis can be done in 44 cpu hours, which
is less than the run time of even the GAII - thus the analysis can easily be
done on 1 core in near-real-time on the PC. Likewise - you can analyse the
last runfolder - on the instrument PC - whilst collecting the current one.
Post run you can complete an analysis of an entire runfolder in 1 minute
(albeit on 330 * 8 cores) - or more realistically in 40 minutes or so on a
reasonable cluster.
Should be a runnable version by end of quarter.
C.
--
Clive G. Brown
Group Leader, (Next-Gen) Sequencing Informatics R&D
Wellcome Trust Sanger Institute
Hinxton Genome Campus
Hinxton
Cambridge
UK
CB10 1SA
+44(0) 1223 834244 Ext: 5343
OR
+44(0) 1223 495343 (Direct)
http://ical.mac.com/clive_g_brown/Clive%20G.%20Brown
http://www.linkedin.com/pub/3/533/a37
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.