On 02/10/2012 09:29 PM, Ross Whetten wrote:
> I am planning to use EC2 for a course I teach at NC State Mar 2 - Apr
> 27 on analysis of deep sequencing data, and in preparation I am
> modifying a cloudbiolinux AMI (64-bit Ubuntu 11.10) to include a lot
> of NGS-related software, some from Debian-Med and some installed via
> other mechanisms.
Once you have some time to breathe again, would you feel prepared to
contribute the redistributable parts of your NGS-related software to
Debian/Ubuntu/BioLinux?
[..snip..]
> I would also appreciate any pointers to information on how to purge
> private information from an image before making it public - I have
> found some pages in the EC2 documentation, but the information seems
> somewhat disjointed, and I would like to see a more coherent
> explanation if one exists.
I am not sure I would remove private bits. Those should not go up
there in the first place. Otherwise, you may never know. Ok, this
sounds a bit trivial.
My suggestion would be to find a way to install everything you need
in an automated fashion, which could be Debian/Ubuntu/Bio-Linux
packages or selected tarballs of what you have already done for
your current installation.
Was that any helpful?
Steffen
> I am planning to use EC2 for a course I teach at NC State Mar 2 - Apr
> 27 on analysis of deep sequencing data, and in preparation I am
> modifying a cloudbiolinux AMI (64-bit Ubuntu 11.10) to include a lot
> of NGS-related software, some from Debian-Med and some installed via
> other mechanisms. I have made a snapshot of the modified image, but
> plan to make it public after I make sure everything works as intended,
That's great. We'd definitely like to get the packages you felt were
missing into the main CloudBioLinux distribution. Would you be able to
provide a list of the software you installed?
> I found the thread on getting NX Client login to work (using the NX
> default key makes sense, but is non-intuitive), but now have a
> question about continuing to use NX. I started a new instance from the
> snapshot after terminating the original session, but the new instance
> will not let me login via NX Client, using the password I set up for
> user 'ubuntu' in the first session. Does terminating an instance reset
> the password for user ubuntu?
Terminating an instance will mess this up, since by default ssh password
login is not allowed. Here is all of the setup that needs to happen:
https://github.com/chapmanb/cloudbiolinux/blob/master/installed_files/setupnx.sh
If you can converge to a base CloudBioLinux image with your additional
software installed on a CloudMan data directory, then you can create a
share instance of this and use that for instantiating. The benefit it
you could use BioCloudCentral (biocloudcentral.org) to launch instances
and it would take care of the NX setup.
> I would also appreciate any pointers to information on how to purge
> private information from an image before making it public - I have
> found some pages in the EC2 documentation, but the information seems
> somewhat disjointed, and I would like to see a more coherent
> explanation if one exists.
Here is the cleanup we do when preparing CloudBioLinux:
https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/cloudman.py#L87
From a security standpoint you want to remove you ssh keys:
rm -f /etc/ssh/ssh_host_*
but you want to be sure to do this only immediately before building the
AMI, otherwise you will lock yourself out.
Hope this helps,
Brad
On 02/10/2012 10:39 PM, Ross Whetten wrote:
>> Once you have some time to breathe again, would you feel prepared to
>> contribute the redistributable parts of your NGS-related software to
>> Debian/Ubuntu/BioLinux?
>
> Just to be clear, none of it is actually 'my' software - I am just
> installing it on the AMI from sourceforge or the authors' websites. I
> am not a computer scientist, and making Debian packages is well
> outside my expertise. I think making the revised AMI public may be the
> extent of my abilities.
That is just fine. You play a dual role as
a) by defining packages to complete a particular workflow
b) by mediating between the cloud biolinux community and the biological problem
and I am not sure about what is more important. We need both. So we need you twice :)
>> My suggestion would be to find a way to install everything you need
>> in an automated fashion, which could be Debian/Ubuntu/Bio-Linux
>> packages or selected tarballs of what you have already done for
>> your current installation.
>
> The Cloudbiolinux AMIs are great, and Debian-Med is an easy way to add
> the sequence-handling tools (bedtools, samtools, picard, fastx-
> toolkit, tabix) and aligners (bowtie, bwa, last-align). The
> repositories used by the AMI (us-east1c) did not have the vcftools or
> sra-toolkit packages, and I'm not sure why there are region-specific
> repositories instead of the generic ones, so I have not tried to
> modify the repo list.
The package is fairly recent
http://packages.qa.debian.org/s/sra-sdk.html
The repository may just not yet be updated.
> The source code for most of the NGS-related programs is updated
> frequently (multiple times per year), so building an installation
> script to compile and install from source and deal with all the
> dependencies seems like a lot of effort to go through several times a
> year, and if the script is not kept up to date, it will rapidly become
> obsolete. I have never done that kind of scripting, so I don't know
> exactly what is involved, but it does not seem worthwhile at this
> point.
Once we find a couple of individuals who are prepared to craft
and maintain the Debian/Ubuntu/Bio-Linux packages for what is
missing, there are the regular tools provided by the CloudBioLinux
community to prepare the image for you. With the packages available
in the distribution, it is also fairly straight forward to update
the the running instance on the fly, which comes very handy at times
and reduces the pressure on you to steadily increase the frequency
of revisions for the image your provide ... just for the reasons
you outlined - the frequent updates.
You may righteously be shying away a bit from preparing the first
Debian/Ubuntu/Bio-Linux package for some favorite software of yours.
But since you have already managed to bite yourself through all
those cloud imaging instructions, you can certainly (community-)maintain
any given package when there are new versions out. Or Bayes was wrong.
I know Tim on this list to be a NGS enthusiast. And there are others.
Once you are up for it, just list what you need. And then we see
if that gap can be closed - software needs to be redistributable
and one needs a volunteer maintainer.
Cheers,
Steffen
> Thanks for the quick reply, and the github links to the code. I found I
> could delete the .nx_setup_done flagfile and re-run configure_freenx.sh to
> get NX working again, which seems the functional equivalent of the
> setupnx.sh script.
Yes, that runs exactly the same code so is exactly right.
> Regarding using a CloudMan directory - this seems similar to what I have
> been doing for my research EC2 computing, where I have all the software on
> my own EBS volume with my data, and just attach that to whatever size
> instance I need. I don't know if the CloudMan directory could be made
> public, so others could use it - I expect that more and more researchers
> will find EC2 a desirable alternative for analyzing NGS data.
Absolutely, CloudMan organizes the sharing process, handling the
creation of snapshots from EBS volumes and making it easier for others
to use your shared data. I have a blog post that describes using this in
more depth:
Happy to answer any specific questions if you give this a try.
> I will be
> sharing the image through IAM with students in my class, but I think others
> might find it useful as well. Of course, that is moot if Cloudbiolinux is
> updated to include the software I'm using for the class.
> I'll send a list of the programs I've installed later - I have to go
> now.
That sounds great, thank you. This is exactly the intention of
CloudBioLinux. We want to provide a community curated image so each
person doesn't have to build and make available images and we can reuse
work. The cost of adding new programs is pretty low with the build
framework. Then CloudMan adds on top of that to provide a way to share
data along with the tools.
Thanks for all the feedback,
Brad
Below is the full list of packages you mention. Two I did not add yet were
TASSEL and STACKS: could you provide urls for these? I'm not familiar
with them and couldn't dig them up with some web searching.
We can roll a new AMI that includes these so you have a base
CloudBioLinux image to use for your class and going forward. Thanks
again for the feedback and suggestions,
Brad
Need to add:
- TASSEL
- STACKS
Added:
- last-align
- Echo
- dwgsim
- Stampy
- CRISP
Included with CloudBioLinux:
- bedtools
- samtools
- picard
- fastx-toolkit
- tabix
- bowtie
- bwa
- sqlite
- base v 2.9 Bioconductor
- biomaRt
- RSQLite
- Rsamtools
- ShortRead
- edgeR
- DEseq
- FastQC
- SolexaQA
- Tophat
- Cufflinks
- Bowtie2
- ABySS
> No, I just did not realize those packages were already available - I failed
> to dig deeply enough into the custom.yaml list of packages on the github
> config page. This may be a barrier for other biologists who might be
> interested in using the Cloudbiolinux AMIs as well - it is not obvious to
> those of us in the less CS-literate population how to find out what is
> installed and where it is in the file system.
Thanks for the feedback on this. I added in links to the packages from
the README. Great suggestion. The longer term solution is to have better
documentation and examples to demonstrate the installed software.
> TASSEL is a java pipeline that includes functions for association
[...]
> Similarly, STACKS is software for analysis of data from a different
[...]
Thanks for these pointers. I've added custom builds for these as well so
I think we've captured everything. We'll work on rolling a new release
with this functionality included.
Great stuff,
Brad
> > We'll work on rolling a new release with this functionality included.
>
> Great! As I noted in my initial query, I start teaching the class on Mar 2,
> but the following week is spring break, so the second class meeting is Mar
> 12. The tools already in the Cloudbiolinux image will keep us occupied for
> a while, but by late March we'll be ready for some of the new software you
> have added.
I pushed a new AMI, ami-500cd139, for Ubuntu 11.10 which has all of the
new packages you suggested as well as the usual set of updates. Hope
this works for your class -- let us know how it works out.
> Mirroring the full list of software (both packaged and custom-installed)
> onto a page entitled "Bioinformatics Software installed on Cloudbiolinux
> Images" could yield Google search results more readily recognized by a
> biologist as relevant to his/her interests and understanding, particularly
> if the meta-tags are designed to attract searches by naive users.
Thanks for these great ideas. We have a hackathon coming up in July in
association with BOSC and will keep these ideas as targets to help
improve the documentation. Much appreciated,
Brad
> 1. The fastx_toolkit programs are installed in the path, but the two
> compiled programs I have tried (fastx_quality_stats and
> fastx_artifacts_filter) both fail with a buffer overflow error
> message, while the fastx_barcode_splitter.pl script works.
When I dug into this I realized there are now Debian packages for
fastx-toolkit, so I'll use those instead of the manually compiled
version. Thanks for the heads up.
> 2. SolexaQA.pl is installed and runs, but the matrix2png program that
> produces the heatmap outputs is not installed, or at least is not
> found by the perl script when it runs. The matrix2png package requires
> the php5-gd library, and installing both of those results in
> successful SolexaQA outputs with heatmaps.
The problem with matrix2png is that it hides the download behind a web
form. Practically, I've moved SolexaQA out of my pipelines since it is
slow with the large datasets generated by current Illumina machines. It
takes much longer to run than the alignment. So I never had any
motivation to reverse-engineer the matrix2png download to work in an
automated build.
I'll take this off the image, and recommend using FastQC instead:
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
which can handle large file sizes.
> 3. Gnu Emacs is (I am told) a fabulous editor for those who can use
> it, but the learning curve is steep for those who cannot yet use it. I
> have not yet taken the time to learn it, so I install nedit or gedit
> from the repos to have something I know how to use.
Thanks for this. I added gedit to the packages as well.
Let me know if you have any other feedback or thoughts as you dig into
it. Once everything looks good I can roll up a new AMI with the fixes.
Brad