Can someone recommend a consultant to help us integrate our high performance computing and storage infrastructure in support of Illumina NGS?

11 views
Skip to first unread message

Gustavo Kijak

unread,
Jul 26, 2010, 6:23:19 PM7/26/10
to solexa
A little bit of background:

We are just getting started with the Illumina GA IIx. We had no
preexisting IT infrastructure on site to support this work so we had
to build it from scratch. Along with the single instrument we bought
the small server sold by Illumina (8 Tb or so) but soon proved to be
to small for our needs (whole human genome sequencing at 25-30x
coverage). We bought a SAN (42 Tb) and we have recently bought a 128-
node cluster.
Software-wise, we would use Illumina pipeline, and CASAVA just for QC
and to inform us if we have enough coverage. The real assembly and SNP
calling will be done with BWA and SAMTools.
The intensive storage/computing needs imposed by NGS are a little bit
outside the expertise of our IT dept and they are having a hard time
getting all of the above-mentioned components to work together and
properly.
Any help we can get from you will be highly appreciated!

TerryC

unread,
Jul 27, 2010, 9:43:49 AM7/27/10
to solexa
Gustavo,

our institution has been managing Illumina sequencers from an
infrastructure perspective for just under two years now. I understand
the challenges you are facing. The Bioinformatics and IT issues, at
least from my perspective, can be broken down into a few categories:
Pipeline mgmt & computation, Storage & Retrieval, Bioinformatics, and
general IT/networking issues. The first question is, "Are you a core
facility (i.e. Are you processing samples for other labs)?" or "a
large research lab (i.e. All of the samples belong only to your
group)?". If you are in the first category (i.e. a core facility),
then you should consider "how" your clients will obtain their results,
and ONLY their results, in way that is not overly burdensome to your
group - such as burning to CDs, creating unique ftp shares each time,
etc. Additionally, the results should be cataloged so that it is easy
to track results based on clients, samples, lanes, flowcells, etc-
basically whatever metadata you decide to apply to your flowcell runs.
We have one client that has already processed over 500 samples. It can
be a real chore for the lab and its collaborators to track down
specific results 6 months later for a specific sample. If you are not
a core facility, storage and retrieval of data can be simplified
greatly. Just store it all on a network drive with sufficient disk
space.

You will also need to have policies on which results you store, and
for how long. Initially lots of researchers wanted to keep their
images, but now we pretty much just give out the eland_extended and
sorted files. Soon (with the new Gerald release) we will probably only
offer BAM/SAM formats with scripts to covert to previous formats for
those groups that developed custom bioinformatic scripts built around
say sorted or eland_extended, etc.

Our group solved this problem for our nucleic acid sequencing
facility. We built a web-based software tool to configure flow cells
which captures meta-data about the sample and experiment and the
details of the Illumina Run (i.e. single/paired end, genome, assembly
version, RNA, adapter, etc..). From this information we built a series
of configuration files which we pass on to another application (a
series of python and PBS scripts) that transfers the raw data to our
super-computing facility (any old cluster here would do) and automates
our pipeline runs remotely. The great thing about building this around
our super-computing facility is that scaling up as we add more
sequencers will not be a problem as far as meeting future
computational needs (storage could be a problem) since our super-
computing facility has thousands of nodes available. We initially
looked at the Amazon Web Services and EC2 specifically, but the
transfer costs would have been a killer for every run. Our
supercomputing facility is free so it was a no-brainier.

Our python server performs the following:

1.) transfer data to and from our core facility and the supercomputing
facility
2.) manage the gerald make process
3.) We added some custom pipeline steps for keeping statistics, and
building .bed and .wig files for integration with UCSC genome browser
and IGV tools
3.) Report status updates to web-based tool
4.) Partition data based on lane, sample, and client information
5.) Send data - based on client details to either an sftp site for
download or a web-based application where results are searchable based
on meta-data tags that were initially captured and downloadable.
5.) Make backups of the results to a network storage device that we
keep for about 6-9 months (it is a 20TB drive).

We also built this infrastructure so that it would be easy to add
other sequencing technologies like Solid, single molecule,
whatever...All we have to do is replace the python scripts specific to
Gerald with a different set.

If you would like more information or have any questions, let me know.
We are more than happy to share our experiences with anyone else that
is struggling with these sorts of problems.

Regards,

Terry C

Davide Cittaro

unread,
Jul 27, 2010, 10:52:27 AM7/27/10
to sol...@googlegroups.com
Hi Gustavo,

On Jul 27, 2010, at 12:23 AM, Gustavo Kijak wrote:

> A little bit of background:
>
> We are just getting started with the Illumina GA IIx. We had no
> preexisting IT infrastructure on site to support this work so we had
> to build it from scratch. Along with the single instrument we bought
> the small server sold by Illumina (8 Tb or so) but soon proved to be
> to small for our needs (whole human genome sequencing at 25-30x
> coverage). We bought a SAN (42 Tb) and we have recently bought a 128-
> node cluster.

Whoa, that's a pretty huge cluster (128 nodes? how many cores per node?)! We have a GA II and mostly process ChIP-seq / RNA-seq samples (much less SNP calling). Most of our analysis doesn't require that computational power, hence we run on a 16cores/32Gb server all alignments and post-processing. For additional and longer analysis there's a small cluster (3 nodes/24 cores per node/64 Gb RAM per node) which also serves other computational tasks.
We process 1-3 Illumina runs per month (and if you have one instrument you can only run once a week). We just started delivering results on a galaxy server (http://usegalaxy.org) which has 8 cores, 16 Gb RAM and 3 Tb of disk space (managed with ZFS + Nexenta OS). We currently deliver fastq (actually srf files), BAM and all post-processed files (bigWig, tables...)

> Software-wise, we would use Illumina pipeline, and CASAVA just for QC
> and to inform us if we have enough coverage. The real assembly and SNP
> calling will be done with BWA and SAMTools.

We don't use Illumina software anymore (except for basecalling and reports). We perform aligments with bwa. We extensively use samtools, bedtools and custom software.

d

> The intensive storage/computing needs imposed by NGS are a little bit
> outside the expertise of our IT dept and they are having a hard time
> getting all of the above-mentioned components to work together and
> properly.
> Any help we can get from you will be highly appreciated!
>

> --
> You received this message because you are subscribed to the Google Groups "solexa" group.
> To post to this group, send email to sol...@googlegroups.com.
> To unsubscribe from this group, send email to solexa+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/solexa?hl=en.
>

---
Davide Cittaro
daweo...@gmail.com
http://sites.google.com/site/davidecittaro/

Reply all
Reply to author
Forward
0 new messages