Accessing 1000 Human Genomes data with Cloud BioLinux

agbiotec

unread,

Mar 18, 2012, 4:17:19 PM3/18/12

to cloudb...@googlegroups.com

Hi guys,

I released a fork of the Cloud BioLinux AMI ( ami-06845b6f) on Amazon EC2
which auto-mounts the 1000 Genomes S3 data buckets (s3sync). Here's a demo video:

http://goo.gl/n8jt2

I also added a second video that demonstrates step-by-step for end-users
how to start Cloud BioLinux and connect to it via remote desktop interface

(I did a mod to allow users providing a password via the user data):

http://goo.gl/bYzEi

Let me know what you guys think... this is a collaboration with NCBI,

so we'll be getting some good publicity there...

The Cloud BioLinux paper is in press at BMC Bioinformatics, and once
I have the early access / provisional copy link I will send that as well.

cheers,

Ntino

Steffen Möller

unread,

Mar 18, 2012, 5:05:32 PM3/18/12

to cloudb...@googlegroups.com

Hello,

On 03/18/2012 09:17 PM, agbiotec wrote:

> I released a fork of the Cloud BioLinux AMI ( ami-06845b6f) on Amazon EC2
> which auto-mounts the 1000 Genomes S3 data buckets (s3sync). Here's a demo
> video:
>

> http://goo.gl/n8jt2<https://owa.jcvi.org/OWA/redir.aspx?C=8e009090b1914f9ba0cfd2c3fdaf4e12&URL=http%3a%2f%2fgoo.gl%2fn8jt2>

>
> I also added a second video that demonstrates step-by-step for end-users
> how to start Cloud BioLinux and connect to it via remote desktop interface
> (I did a mod to allow users providing a password via the user data):
>

> http://goo.gl/bYzEi<https://owa.jcvi.org/OWA/redir.aspx?C=8e009090b1914f9ba0cfd2c3fdaf4e12&URL=http%3a%2f%2fgoo.gl%2fbYzEi>

I'll test this with my colleagues tomorrow. This is very interesting, indeed. Have many thanks!

> Let me know what you guys think... this is a collaboration with NCBI,
> so we'll be getting some good publicity there...

Excellent. A prominent link and employment in many papers - that'd be it.

> The Cloud BioLinux paper is in press at BMC Bioinformatics, and once
> I have the early access / provisional copy link I will send that as well.

Nice!

Steffen

Brad Chapman

unread,

Mar 19, 2012, 7:20:12 AM3/19/12

to cloudb...@googlegroups.com

Ntino;

> I released a fork of the Cloud BioLinux AMI ( ami-06845b6f) on Amazon EC2
> which auto-mounts the 1000 Genomes S3 data buckets (s3sync).

This is brilliant. Thanks for putting it together. I added links to the
tutorials from the main CloudBioLinux page so they can be more widely
seen.

Would you want to add this as an option to the standard AMI? We could
attach a flag to the user-data YAML like:

data_1000genomes: true

to automatically load it, and then also add this as an option when
booting clusters from BioCloudCentral so you can avoid the AWS console
as much as possible.

> (I did a mod to allow users providing a password via the user data):

This does work now but doesn't use bare passwords. You want to do:

freenxpass: yourpassword

Avoiding bare passwords allows us to support CloudMan, and add other
options like the data_1000genomes suggestion.

Thanks again for this,
Brad

agbiotec

unread,

Mar 19, 2012, 11:15:01 AM3/19/12

to cloudb...@googlegroups.com

sorry guys, I keep getting this "owa.jcvi.org" thingy added to my links. Here

are the right ones for the videos,

http://youtu.be/A8JLh44L1Cw

http://youtu.be/2a1D2QL0u9Y

Brad,

I'll add a small python script to the cloudbiolinux fabric framework that mounts

the 1000 genomes buckets via s3fs / fuse and creates a shortcut on the desktop.

I'll also set it up to be activated during boot time via the "data_1000genomes: true"

flag in the user data.

Should we then deploy that feature along with all other stuff on the next fab roll out of the VM ?

Would that be couple of months down the road ? I guess the biocloudcentral webapp will

need some extension to include additional user data when it boots the amis - or is that feature

there already there and I missing it ?

cheers,

Ntino

agbiotec

unread,

Mar 19, 2012, 11:29:26 AM3/19/12

to cloudb...@googlegroups.com

and for those of you that can't wait to read the paper, here it is !

http://www.biomedcentral.com/imedia/1873666514589126_article.pdf

James Taylor

unread,

Mar 19, 2012, 11:37:27 AM3/19/12

to cloudb...@googlegroups.com

On Mar 19, 2012, at 7:20 AM, Brad Chapman wrote:

> Would you want to add this as an option to the standard AMI? We could
> attach a flag to the user-data YAML like:
>
> data_1000genomes: true
>
> to automatically load it, and then also add this as an option when
> booting clusters from BioCloudCentral so you can avoid the AWS console
> as much as possible.

It would be great to have this be even more general. You could have a registry of different datasets that can be mounted (in a yaml file in a bucket somewhere), and on startup provide a list of identifiers for things to mount, BioCloudCentral could generate a GUI from the registry.

-- jt

James Taylor, Assistant Professor, Biology / Computer Science, Emory University

agbiotec

unread,

Mar 19, 2012, 11:45:10 AM3/19/12

to cloudb...@googlegroups.com

James,

that's in fact a very good idea !

Enis Afgan

unread,

Mar 19, 2012, 7:19:14 PM3/19/12

to cloudb...@googlegroups.com

This could be supported within CloudMan in a fairly straightforward manner: cloudman already has provisions for dealing with these read-only snapshots so it would be a matter of adding some code to extend these new, custom file systems down to NFS. Also, as things work now, anything that's added would automatically persist for future invocations so not require repetitive user action.

The remaining 'bigger' thing to do would be adding a library of the available snapshots to choose from, but that has be be done anyhow.

--
You received this message because you are subscribed to the Google Groups "cloudbiolinux" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cloudbiolinux/-/HhgZSC_fEoEJ.

To post to this group, send email to cloudb...@googlegroups.com.
To unsubscribe from this group, send email to cloudbiolinu...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cloudbiolinux?hl=en.

Brad Chapman

unread,

Mar 19, 2012, 9:02:11 PM3/19/12

to Enis Afgan, cloudb...@googlegroups.com

Enis, Ntino, James;
Y'all are awesome. Generalizing this mounting of data directories would
be the perfect way to move forward.

From the BioCloudCentral side, we can add a set of checkboxes with
available data sources. Making this dynamic from some set of
configuration is a great goal. Short term we can put together a process
to support the 1000 genomes data and then move forward from there.

Enis, including this CloudMan is a great idea. Thanks for offering to
help with this. Ntino, do you have the script available you use for
automated mounting of the filesystem and setting up the desktop?

From the CloudBioLinux side, we can roll a new AMI anytime if we need
changes there to help support it.

Thanks again all,
Brad

Tim Booth

unread,

Mar 21, 2012, 10:04:22 AM3/21/12

to cloudb...@googlegroups.com

Hi All,

Definitely a good idea. There are two related projects I know of that
are aiming to help locate and download reference datasets onto
standalone Linux instances:

biomaj.genouest.org/
wiki.debian.org/getData

In getData, a set of configuration snippets identify reference datasets
of interest, and when activated the scripts will pull down the dataset
and keep it updated (via cron jobs) as appropriate. Biomaj is similar
but already has a graphical user interface to select the datasets you
want to track.

Both of these are works in progress, but my idea was that they should
work "intelligently" in a cloud environment. For example, the user
should simply tell getData (via a configuration dialog) that he wants
the 1000genomes dataset plus the latest swissprot in blastable format.
On a regular machine the backend will go and download the data via FTP,
but in a cloud context it will mount S3 buckets etc. This would help
any user moving jobs on and off EC2 or other cloud services.

We discussed various ideas at the Debian Med sprint meeting in January.
The natural conclusion of the discussion is to try and do for open
reference data what the Debian package repository does for free software
- a kind of "digital NAR databases" registry that various tools could
reference. But I've not thought about the idea since then, and maybe in
this context it's just too much of an abstract idea to be worrying
about. I guess in some ways it comes down to whether we want to
distinguish CBL as having special access to these type of datasets as a
USP of the system, or whether we want to get involved with other,
broader, attempts to manage reference data on non-cloud platforms.

Cheers,

TIM

> --
> You received this message because you are subscribed to the Google
> Groups "cloudbiolinux" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/cloudbiolinux/-/HhgZSC_fEoEJ.
> To post to this group, send email to cloudb...@googlegroups.com.

> To unsubscribe from this group, send email to cloudbiolinux
> +unsub...@googlegroups.com.

> For more options, visit this group at
> http://groups.google.com/group/cloudbiolinux?hl=en.

--
Tim Booth <tbo...@ceh.ac.uk>
NERC Environmental Bioinformatics Centre

Centre for Ecology and Hydrology
Maclean Bldg, Benson Lane
Crowmarsh Gifford
Wallingford, England
OX10 8BB

http://nebc.nerc.ac.uk
+44 1491 69 2705

--
This message (and any attachments) is for the recipient only. NERC
is subject to the Freedom of Information Act 2000 and the contents
of this email and any reply you make may be disclosed by NERC unless
it is exempt from release under the Act. Any material supplied to
NERC may be stored in an electronic records management system.

Brad Chapman

unread,

Mar 22, 2012, 8:06:34 PM3/22/12

to Tim Booth, cloudb...@googlegroups.com

Tim;
Thanks for the links to getData and Biomaj. I'm open to including
whatever solution is most practical. The one tricky thing with a ton of
these datasets is that they are huge and users will only want some
specific portion. For instance, with 1000 genomes users will probably
want to extract a couple of individuals or the VCF files. Similarly with
genome data users will want specific organisms.

Having an S3 filesystem mount is a quick and dirty way to get there
without having to build targets for each individual subset and that's
similar to what I've tried to do with data_fabfile for genomes.

Just some other brainstorming thoughts. Looking forward to seeing more
data associated with the CloudBioLinux image,
Brad

> To unsubscribe from this group, send email to cloudbiolinu...@googlegroups.com.

Reza Safarnejad

unread,

Feb 21, 2013, 2:01:56 PM2/21/13

to cloudb...@googlegroups.com

Hi NTino. I work at NCBI and I am having trouble connecting to the image. Is the user-name and password still Ubuntu and testpass?

Thanks,

Reza Safarnejad

Reply all

Reply to author

Forward

Message has been deleted