NGS-focused AMI

89 views
Skip to first unread message

Ross Whetten

unread,
Feb 10, 2012, 3:29:22 PM2/10/12
to cloudbiolinux
I am planning to use EC2 for a course I teach at NC State Mar 2 - Apr
27 on analysis of deep sequencing data, and in preparation I am
modifying a cloudbiolinux AMI (64-bit Ubuntu 11.10) to include a lot
of NGS-related software, some from Debian-Med and some installed via
other mechanisms. I have made a snapshot of the modified image, but
plan to make it public after I make sure everything works as intended,
and after I am sure I have purged all my private information from the
snapshot.
I found the thread on getting NX Client login to work (using the NX
default key makes sense, but is non-intuitive), but now have a
question about continuing to use NX. I started a new instance from the
snapshot after terminating the original session, but the new instance
will not let me login via NX Client, using the password I set up for
user 'ubuntu' in the first session. Does terminating an instance reset
the password for user ubuntu?
I would also appreciate any pointers to information on how to purge
private information from an image before making it public - I have
found some pages in the EC2 documentation, but the information seems
somewhat disjointed, and I would like to see a more coherent
explanation if one exists.

Steffen Möller

unread,
Feb 10, 2012, 3:49:27 PM2/10/12
to cloudb...@googlegroups.com
Hello,

On 02/10/2012 09:29 PM, Ross Whetten wrote:
> I am planning to use EC2 for a course I teach at NC State Mar 2 - Apr
> 27 on analysis of deep sequencing data, and in preparation I am
> modifying a cloudbiolinux AMI (64-bit Ubuntu 11.10) to include a lot
> of NGS-related software, some from Debian-Med and some installed via
> other mechanisms.

Once you have some time to breathe again, would you feel prepared to
contribute the redistributable parts of your NGS-related software to
Debian/Ubuntu/BioLinux?

[..snip..]


> I would also appreciate any pointers to information on how to purge
> private information from an image before making it public - I have
> found some pages in the EC2 documentation, but the information seems
> somewhat disjointed, and I would like to see a more coherent
> explanation if one exists.

I am not sure I would remove private bits. Those should not go up
there in the first place. Otherwise, you may never know. Ok, this
sounds a bit trivial.

My suggestion would be to find a way to install everything you need
in an automated fashion, which could be Debian/Ubuntu/Bio-Linux
packages or selected tarballs of what you have already done for
your current installation.

Was that any helpful?

Steffen

Brad Chapman

unread,
Feb 10, 2012, 4:21:51 PM2/10/12
to Ross Whetten, cloudbiolinux

Ross;

> I am planning to use EC2 for a course I teach at NC State Mar 2 - Apr
> 27 on analysis of deep sequencing data, and in preparation I am
> modifying a cloudbiolinux AMI (64-bit Ubuntu 11.10) to include a lot
> of NGS-related software, some from Debian-Med and some installed via
> other mechanisms. I have made a snapshot of the modified image, but
> plan to make it public after I make sure everything works as intended,

That's great. We'd definitely like to get the packages you felt were
missing into the main CloudBioLinux distribution. Would you be able to
provide a list of the software you installed?

> I found the thread on getting NX Client login to work (using the NX
> default key makes sense, but is non-intuitive), but now have a
> question about continuing to use NX. I started a new instance from the
> snapshot after terminating the original session, but the new instance
> will not let me login via NX Client, using the password I set up for
> user 'ubuntu' in the first session. Does terminating an instance reset
> the password for user ubuntu?

Terminating an instance will mess this up, since by default ssh password
login is not allowed. Here is all of the setup that needs to happen:

https://github.com/chapmanb/cloudbiolinux/blob/master/installed_files/setupnx.sh

If you can converge to a base CloudBioLinux image with your additional
software installed on a CloudMan data directory, then you can create a
share instance of this and use that for instantiating. The benefit it
you could use BioCloudCentral (biocloudcentral.org) to launch instances
and it would take care of the NX setup.

> I would also appreciate any pointers to information on how to purge
> private information from an image before making it public - I have
> found some pages in the EC2 documentation, but the information seems
> somewhat disjointed, and I would like to see a more coherent
> explanation if one exists.

Here is the cleanup we do when preparing CloudBioLinux:

https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/cloudman.py#L87

From a security standpoint you want to remove you ssh keys:

rm -f /etc/ssh/ssh_host_*

but you want to be sure to do this only immediately before building the
AMI, otherwise you will lock yourself out.

Hope this helps,
Brad

Ross Whetten

unread,
Feb 10, 2012, 4:39:53 PM2/10/12
to cloudbiolinux
Hi Steffen,
Thanks for the quick reply, and your suggestions.
You wrote
> Once you have some time to breathe again, would you feel prepared to
> contribute the redistributable parts of your NGS-related software to
> Debian/Ubuntu/BioLinux?

Just to be clear, none of it is actually 'my' software - I am just
installing it on the AMI from sourceforge or the authors' websites. I
am not a computer scientist, and making Debian packages is well
outside my expertise. I think making the revised AMI public may be the
extent of my abilities.

> My suggestion would be to find a way to install everything you need
> in an automated fashion, which could be Debian/Ubuntu/Bio-Linux
> packages or selected tarballs of what you have already done for
> your current installation.

The Cloudbiolinux AMIs are great, and Debian-Med is an easy way to add
the sequence-handling tools (bedtools, samtools, picard, fastx-
toolkit, tabix) and aligners (bowtie, bwa, last-align). The
repositories used by the AMI (us-east1c) did not have the vcftools or
sra-toolkit packages, and I'm not sure why there are region-specific
repositories instead of the generic ones, so I have not tried to
modify the repo list.
The source code for most of the NGS-related programs is updated
frequently (multiple times per year), so building an installation
script to compile and install from source and deal with all the
dependencies seems like a lot of effort to go through several times a
year, and if the script is not kept up to date, it will rapidly become
obsolete. I have never done that kind of scripting, so I don't know
exactly what is involved, but it does not seem worthwhile at this
point.

Regards,
Ross

Steffen Möller

unread,
Feb 10, 2012, 5:39:54 PM2/10/12
to cloudb...@googlegroups.com
Hi Ross,

On 02/10/2012 10:39 PM, Ross Whetten wrote:
>> Once you have some time to breathe again, would you feel prepared to
>> contribute the redistributable parts of your NGS-related software to
>> Debian/Ubuntu/BioLinux?
>
> Just to be clear, none of it is actually 'my' software - I am just
> installing it on the AMI from sourceforge or the authors' websites. I
> am not a computer scientist, and making Debian packages is well
> outside my expertise. I think making the revised AMI public may be the
> extent of my abilities.

That is just fine. You play a dual role as
a) by defining packages to complete a particular workflow
b) by mediating between the cloud biolinux community and the biological problem
and I am not sure about what is more important. We need both. So we need you twice :)

>> My suggestion would be to find a way to install everything you need
>> in an automated fashion, which could be Debian/Ubuntu/Bio-Linux
>> packages or selected tarballs of what you have already done for
>> your current installation.
>
> The Cloudbiolinux AMIs are great, and Debian-Med is an easy way to add
> the sequence-handling tools (bedtools, samtools, picard, fastx-
> toolkit, tabix) and aligners (bowtie, bwa, last-align). The
> repositories used by the AMI (us-east1c) did not have the vcftools or
> sra-toolkit packages, and I'm not sure why there are region-specific
> repositories instead of the generic ones, so I have not tried to
> modify the repo list.

The package is fairly recent
http://packages.qa.debian.org/s/sra-sdk.html
The repository may just not yet be updated.

> The source code for most of the NGS-related programs is updated
> frequently (multiple times per year), so building an installation
> script to compile and install from source and deal with all the
> dependencies seems like a lot of effort to go through several times a
> year, and if the script is not kept up to date, it will rapidly become
> obsolete. I have never done that kind of scripting, so I don't know
> exactly what is involved, but it does not seem worthwhile at this
> point.

Once we find a couple of individuals who are prepared to craft
and maintain the Debian/Ubuntu/Bio-Linux packages for what is
missing, there are the regular tools provided by the CloudBioLinux
community to prepare the image for you. With the packages available
in the distribution, it is also fairly straight forward to update
the the running instance on the fly, which comes very handy at times
and reduces the pressure on you to steadily increase the frequency
of revisions for the image your provide ... just for the reasons
you outlined - the frequent updates.

You may righteously be shying away a bit from preparing the first
Debian/Ubuntu/Bio-Linux package for some favorite software of yours.
But since you have already managed to bite yourself through all
those cloud imaging instructions, you can certainly (community-)maintain
any given package when there are new versions out. Or Bayes was wrong.

I know Tim on this list to be a NGS enthusiast. And there are others.
Once you are up for it, just list what you need. And then we see
if that gap can be closed - software needs to be redistributable
and one needs a volunteer maintainer.

Cheers,

Steffen


Brad Chapman

unread,
Feb 12, 2012, 2:08:35 PM2/12/12
to Ross Whetten, cloudbiolinux

Ross;

> Thanks for the quick reply, and the github links to the code. I found I
> could delete the .nx_setup_done flagfile and re-run configure_freenx.sh to
> get NX working again, which seems the functional equivalent of the
> setupnx.sh script.

Yes, that runs exactly the same code so is exactly right.

> Regarding using a CloudMan directory - this seems similar to what I have
> been doing for my research EC2 computing, where I have all the software on
> my own EBS volume with my data, and just attach that to whatever size
> instance I need. I don't know if the CloudMan directory could be made
> public, so others could use it - I expect that more and more researchers
> will find EC2 a desirable alternative for analyzing NGS data.

Absolutely, CloudMan organizes the sharing process, handling the
creation of snapshots from EBS volumes and making it easier for others
to use your shared data. I have a blog post that describes using this in
more depth:

http://j.mp/uNXZY6

Happy to answer any specific questions if you give this a try.

> I will be
> sharing the image through IAM with students in my class, but I think others
> might find it useful as well. Of course, that is moot if Cloudbiolinux is
> updated to include the software I'm using for the class.
> I'll send a list of the programs I've installed later - I have to go
> now.

That sounds great, thank you. This is exactly the intention of
CloudBioLinux. We want to provide a community curated image so each
person doesn't have to build and make available images and we can reuse
work. The cost of adding new programs is pretty low with the build
framework. Then CloudMan adds on top of that to provide a way to share
data along with the tools.

Thanks for all the feedback,
Brad

Ross Whetten

unread,
Feb 12, 2012, 9:39:06 PM2/12/12
to cloudbiolinux
Brad,
The software tools I have installed in the AMI for my sequence
analysis class, in addition to the Debian-Med packages I mentioned in
my response to Steffen, include sqlite, the base v 2.9 Bioconductor
with biomaRt, RSQLite, Rsamtools, ShortRead, edgeR, and DEseq
packages; FastQC, SolexaQA, and Echo (quality assurance/error
correction programs); dwgsim (a sequence-read simulation program);
Stampy (read mapping); Tophat, Cufflinks, and Bowtie2 (read mapping
and RNA-Seq analysis); the ABySS assembler; and CRISP, TASSEL, and
STACKS (programs for sequence variant discovery, the latter two
developed for use with specific library preparation procedures).

These programs do not represent any sort of selected elite chosen as
the best-of-breed in their respective areas of application, and their
inclusion or the exclusion of other programs with comparable functions
does not represent a value judgment regarding the suitability of any
program for a particular purpose. Instead, I have chosen these as a
sample of programs to use as examples in teaching biological sciences
graduate students enough about command-line Linux and EC2 to equip
them to learn what they need to know to analyze their own datasets.

Regards,
Ross

Brad Chapman

unread,
Feb 15, 2012, 8:25:03 AM2/15/12
to Ross Whetten, cloudbiolinux

Ross;
Thanks for the details on the packages. Most of the ones you've listed
are already included as part of CloudBioLinux, and I worked on adding in
the additional suggestions. Did you find a need to reinstall things like
Bioconductor libraries and bowtie?

Below is the full list of packages you mention. Two I did not add yet were
TASSEL and STACKS: could you provide urls for these? I'm not familiar
with them and couldn't dig them up with some web searching.

We can roll a new AMI that includes these so you have a base
CloudBioLinux image to use for your class and going forward. Thanks
again for the feedback and suggestions,
Brad

Need to add:
- TASSEL
- STACKS

Added:
- last-align
- Echo
- dwgsim
- Stampy
- CRISP

Included with CloudBioLinux:
- bedtools
- samtools
- picard
- fastx-toolkit
- tabix
- bowtie
- bwa
- sqlite
- base v 2.9 Bioconductor
- biomaRt
- RSQLite
- Rsamtools
- ShortRead
- edgeR
- DEseq
- FastQC
- SolexaQA
- Tophat
- Cufflinks
- Bowtie2
- ABySS

Ross Whetten

unread,
Feb 15, 2012, 9:56:10 AM2/15/12
to Brad Chapman, cloudbiolinux
Brad,
No, I just did not realize those packages were already available - I failed to dig deeply enough into the custom.yaml list of packages on the github config page. This may be a barrier for other biologists who might be interested in using the Cloudbiolinux AMIs as well - it is not obvious to those of us in the less CS-literate population how to find out what is installed and where it is in the file system. I recognize the value of having the package lists as part of the install scripts on github, because the information can be updated with new package information and is immediately available to those who know where to look.
One suggestion is to link the names of the configuration files listed in the README.md section of the config page to the files themselves. There are already links in the section above the README, but depending on browser window size it is not always apparent that those links are present, so the README can appear to be a dead-end.

TASSEL is a java pipeline that includes functions for association testing using a mixed linear model that incorporates both kinship and population structure (with a GUI), and for analysis of genotyping-by-sequencing (GBS) data obtained by the method of Elshire et al (PLoS One 6:e19379, 2011), which to my knowledge is strictly command-line. The executables are available at
http://www.maizegenetics.net/tassel/tassel3.0_standalone.zip, and documentation is at http://www.maizegenetics.net/tassel/docs/TasselPipelineGBS.pdf. To my knowledge, the GBS component of the TASSEL software has not yet been described in a journal publication

Similarly, STACKS is software for analysis of data from a different method of genotyping by sequencing, called Restriction-site Associated DNA sequencing, or RAD-Seq (Baird et al, PLoS One 3:e3376). The software was described by Catchen et al (Genes|Genomes|Genetics 1:171, 2011), and is available for download at http://creskolab.uoregon.edu/stacks/.

Regards,
Ross

Brad Chapman

unread,
Feb 16, 2012, 6:29:46 AM2/16/12
to Ross Whetten, cloudbiolinux

Ross;

> No, I just did not realize those packages were already available - I failed
> to dig deeply enough into the custom.yaml list of packages on the github
> config page. This may be a barrier for other biologists who might be
> interested in using the Cloudbiolinux AMIs as well - it is not obvious to
> those of us in the less CS-literate population how to find out what is
> installed and where it is in the file system.

Thanks for the feedback on this. I added in links to the packages from
the README. Great suggestion. The longer term solution is to have better
documentation and examples to demonstrate the installed software.

> TASSEL is a java pipeline that includes functions for association

[...]


> Similarly, STACKS is software for analysis of data from a different

[...]

Thanks for these pointers. I've added custom builds for these as well so
I think we've captured everything. We'll work on rolling a new release
with this functionality included.

Great stuff,
Brad

Ross Whetten

unread,
Feb 16, 2012, 5:08:47 PM2/16/12
to Brad Chapman, cloudbiolinux
Brad,

> I added in links to the packages from the README
Thanks - that was quick!


> The longer term solution is to have better documentation
> and examples to demonstrate the installed software.

Here's another suggestion to help those of us for whom Google is the primary source of documentation...

Would it be possible to mirror the lists from the yaml files on the Github config page to a page under the cloudbiolinux.org domain, with a page title that sounds encouraging and familiar to non-CS types?

I found that a Google search for the keywords <cloudbiolinux samtools> brings up the custom.yaml page near the top of the list of search hits, but the page title is "config/custom.yaml at master from chapmanb/cloudbiolinux - GitHub". This is not a title that sounds warm and welcoming to most biologists, and many might not follow up on that result.

Mirroring the full list of software (both packaged and custom-installed) onto a page entitled "Bioinformatics Software installed on Cloudbiolinux Images" could yield Google search results more readily recognized by a biologist as relevant to his/her interests and understanding, particularly if the meta-tags are designed to attract searches by naive users. If the names of software tools on that page could in turn be linked directly to the sourceforge page or other homepage of the software developer, someone who found the "Bioinformatics Software" page could get the documentation on all the tools available, including those they might not have been previously aware of, without anyone writing new documentation. 

Another link on the same page could point to the SeqAnswers dynamic list of NGS software at http://seqanswers.com/wiki/Software/list, so that people who are interested in keeping up with the full range of what is available know where to look.

Another site I discovered recently, and that I think many other biologists would find useful, is http://software-carpentry.org. This might be suitable as a link on the main cloudbiolinux.org page, perhaps at the bottom under "Documentation" or a new heading "Introductory Material".


> We'll work on rolling a new release with this functionality included.

Great! As I noted in my initial query, I start teaching the class on Mar 2, but the following week is spring break, so the second class meeting is Mar 12. The tools already in the Cloudbiolinux image will keep us occupied for a while, but by late March we'll be ready for some of the new software you have added.

Regards,
Ross

Brad Chapman

unread,
Mar 5, 2012, 2:55:56 PM3/5/12
to Ross Whetten, cloudbiolinux

Ross;

> > We'll work on rolling a new release with this functionality included.
>
> Great! As I noted in my initial query, I start teaching the class on Mar 2,
> but the following week is spring break, so the second class meeting is Mar
> 12. The tools already in the Cloudbiolinux image will keep us occupied for
> a while, but by late March we'll be ready for some of the new software you
> have added.

I pushed a new AMI, ami-500cd139, for Ubuntu 11.10 which has all of the
new packages you suggested as well as the usual set of updates. Hope
this works for your class -- let us know how it works out.

> Mirroring the full list of software (both packaged and custom-installed)
> onto a page entitled "Bioinformatics Software installed on Cloudbiolinux
> Images" could yield Google search results more readily recognized by a
> biologist as relevant to his/her interests and understanding, particularly
> if the meta-tags are designed to attract searches by naive users.

Thanks for these great ideas. We have a hackathon coming up in July in
association with BOSC and will keep these ideas as targets to help
improve the documentation. Much appreciated,
Brad

Ross Whetten

unread,
Mar 5, 2012, 4:26:48 PM3/5/12
to Brad Chapman, cloudbiolinux
Brad,
Thank you - I really appreciate the follow-through on this issue. I'll have a chance to work with the new AMI this week, and I'll let you know if I have any additional ideas.
Regards,
Ross

Ross Whetten

unread,
Mar 12, 2012, 10:55:13 PM3/12/12
to cloudbiolinux
Hi Brad,
I have had a chance to play with the new AMI, and have a few
comments.
1. The fastx_toolkit programs are installed in the path, but the two
compiled programs I have tried (fastx_quality_stats and
fastx_artifacts_filter) both fail with a buffer overflow error
message, while the fastx_barcode_splitter.pl script works. Installing
the v 0.0.13 binaries from the Hannon lab download page works fine, so
the problem seems to be specific to the programs rolled into the AMI.

2. SolexaQA.pl is installed and runs, but the matrix2png program that
produces the heatmap outputs is not installed, or at least is not
found by the perl script when it runs. The matrix2png package requires
the php5-gd library, and installing both of those results in
successful SolexaQA outputs with heatmaps.

3. Gnu Emacs is (I am told) a fabulous editor for those who can use
it, but the learning curve is steep for those who cannot yet use it. I
have not yet taken the time to learn it, so I install nedit or gedit
from the repos to have something I know how to use.

Thanks for all your work on the AMI!

Regards,
Ross

Brad Chapman

unread,
Mar 13, 2012, 6:07:29 AM3/13/12
to Ross Whetten, cloudbiolinux

Ross;
Thanks much for the feedback.

> 1. The fastx_toolkit programs are installed in the path, but the two
> compiled programs I have tried (fastx_quality_stats and
> fastx_artifacts_filter) both fail with a buffer overflow error
> message, while the fastx_barcode_splitter.pl script works.

When I dug into this I realized there are now Debian packages for
fastx-toolkit, so I'll use those instead of the manually compiled
version. Thanks for the heads up.

> 2. SolexaQA.pl is installed and runs, but the matrix2png program that
> produces the heatmap outputs is not installed, or at least is not
> found by the perl script when it runs. The matrix2png package requires
> the php5-gd library, and installing both of those results in
> successful SolexaQA outputs with heatmaps.

The problem with matrix2png is that it hides the download behind a web
form. Practically, I've moved SolexaQA out of my pipelines since it is
slow with the large datasets generated by current Illumina machines. It
takes much longer to run than the alignment. So I never had any
motivation to reverse-engineer the matrix2png download to work in an
automated build.

I'll take this off the image, and recommend using FastQC instead:

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

which can handle large file sizes.

> 3. Gnu Emacs is (I am told) a fabulous editor for those who can use
> it, but the learning curve is steep for those who cannot yet use it. I
> have not yet taken the time to learn it, so I install nedit or gedit
> from the repos to have something I know how to use.

Thanks for this. I added gedit to the packages as well.

Let me know if you have any other feedback or thoughts as you dig into
it. Once everything looks good I can roll up a new AMI with the fixes.

Brad

Reply all
Reply to author
Forward
0 new messages