Reproducible research & Docker

233 views
Skip to first unread message

Carl Boettiger

unread,
Aug 7, 2014, 5:39:08 PM8/7/14
to ropensci...@googlegroups.com, Rich FitzJohn, Matt Pennell, Yihui Xie, aez...@gmail.com, w.cor...@unsw.edu.au, ti...@idyll.org

Hi rOpenSci list + friends [^1],

Yay, the ropensci-discuss list is revived!  

Some of you might recall a discussion about reproducible research in the comments of Rich et al’s recent post on the rOpenSci blog where quite a few of people mentioned the potential for Docker as a way to facilitate this.

I’ve only just started playing around with Docker, and though I’m quite impressed, I’m still rather skeptical that non-crazies would ever use it productively. Nevertheless, I’ve worked up some Dockerfiles to explore how one might use this approach to transparently document and manage a computational environment, and I was hoping to get some feedback from all of you.

For those of you who are already much more familiar with Docker than me (or are looking for an excuse to explore!), I’d love to get your feedback on some of the particulars. For everyone, I’d be curious what you think about the general concept.

So far I’ve created a dockerfile and image

If you have docker up and running, perhaps you can give it a test drive:

docker run -it cboettig/ropensci-docker /bin/bash

You should find R installed with some common packages. This image builds on Dirk Eddelbuettel’s R docker images and serves as a starting point to test individual R packages or projects.

For instance, my RNeXML manuscript draft is a bit more of a bear then usual to run, since it needs rJava (requires external libs), Sxslt (only available on Omegahat and requires extra libs) and latest phytools (a tar.gz file from Liam’s website), along with the usual mess of pandoc/latex environment to compile the manuscript itself. By building on ropensci-docker, we need a pretty minimal docker file to compile this environment:

You can test drive it (docker image here):

docker run -it cboettig/rnexml /bin/bash

Once in bash, launch R and run rmarkdown::render("manuscript.Rmd"). This will recompile the manuscript from cache and leave you to interactively explore any of the R code shown.

Advantages / Goals

Being able to download a precompiled image means a user can run the code without dependency hell (often not as much an R problem as it is in Python, but nevertheless one that I hit frequently, particularly as my projects age), and also without altering their personal R environment. Third (in principle) this makes it easy to run the code on a cloud server, scaling the computing resources appropriately.

I think the real acid test for this is not merely that it recreates the results, but that others can build and extend on the work (with fewer rather than more barriers than usual). I believe most of that has nothing to do with this whole software image thing — providing the methods you use as general-purpose functions in an R package, or publishing the raw (& processed) data to Dryad with good documentation will always make work more modular and easier to re-use than cracking open someone’s virtual machine. But that is really a separate issue.

In this context, we look for an easy way to package up whatever a researcher or group is already doing into something portable and extensible. So, is this really portable and extensible?

Concerns:

  1. This presupposes someone can run docker on their OS — and from the command line at that. Perhaps that’s the biggest barrier to entry right now, (though given docker’s virulent popularity, maybe something smart people with big money might soon solve).

  2. The only way to interact with thing is through a bash shell running on the container. An RStudio server might be much nicer, but I haven’t been able to get that running. Anyone know how to run RStudio server from docker?

(I tried & failed: https://github.com/mingfang/docker-druid/issues/2)

  1. I don’t see how users can move local files on and off the docker container. In some ways this is a great virtue — forcing all code to use fully resolved paths like pulling data from Dryad instead of their hard-drive, and pushing results to a (possibly private) online site to view them. But obviously a barrier to entry. Is there a better way to do this?

Alternative strategies

1) Docker is just one of many ways to do this (particularly if you’re not concerned about maximum performance speed), and quite probably not the easiest. Our friends at Berkeley D-Lab opted for a GUI-driven virtual machine instead, built with Packer and run in Virtualbox, after their experience proved that students were much more comfortable with the mouse-driven installation and a pixel-identical environment to the instructor’s (see their excellen paper on this).

2) Will/should researchers be willing to work and develop in virtual environments? In some cases, the virtual environment can be closely coupled to the native one — you use your own editors etc to do all the writing, and then execute in the virtual environment (seems this is easier in docker/vagrant approach than in the BCE.

[^1]: friends cc’d above: We’re reviving this ropensci-discuss list to chat about various issues related to our packages, our goals, and more broad scientific workflow issues. I’d encourage you to sign up for the listserve: https://groups.google.com/forum/#!forum/ropensci-discuss


Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

Rich FitzJohn

unread,
Aug 7, 2014, 6:11:11 PM8/7/14
to Carl Boettiger, ropensci...@googlegroups.com, Matt Pennell, Yihui Xie, Amy Zanne, Will Cornwell, ti...@idyll.org
Hi Carl,

Thanks for this!

I think that docker is always going to be for the "crazies", at least
in it's current form. It requires running on Linux for starters -
I've got it running on a virtual machine on OSX via virtualbox, but
the amount of faffing about there is pretty intimidating. I believe
it's possible to get it running via vagrant (which is in theory going
to be easier to distribute) but at that point it's all getting a bit
silly. It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do. We're down a
rabbit hole here.

I've been getting drone (https://github.com/drone/drone ) up and
running here for one of our currently-closed projects. It uses docker
as a way of insulating the build/tests from the rest of the system,
but it's still far from ready to recommend for general use. The
advantages I see there are: our test suite can run for several hours
without worrying about running up against allowed times, and working
for projects that are not yet open source. It also simplifies getting
things off the container, but I think there are a bunch of ways of
doing that easily enough. However, I'm on the lookout for something
much simpler to set up, especially for local use and/or behind NAT. I
can post the dockerfile at some point (it's apparently not on this
computer!) but it's similarly simple to yours.

As I see it, the great advantage of all these types of approaches,
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will probably
work for someone else.

I'm OK with this being nerd only for a bit, because (like travis etc)
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over into
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds to work
with large data sets that are tedious to download: pull the raw data
as one RUN command, wrangle as another. Then a separate wrangle step
will reuse the intermediate container (I believe). This is sort of a
different way of doing the types of things that Ethan's "eco data
retriever" aims to do. There's some overlap here with make, but in a
way that would let you jump in at a point in the analysis in a fresh
environment.

I don't think that people will jump to using virtual environments for
the sake of it - there has to be some pay off. Isolating the build
from the rest of your machine or digging into a 5 year old project
probably does not have widespread appeal to non-desk types either! I
think that the biggest potential draws are the CI-type tools, but
there are probably other tools that require isolation/virtualisation
that will appeal broadly. Then people will accidentally end up with
reproducible work :)

Cheers,
Rich

Carl Boettiger

unread,
Aug 7, 2014, 6:44:02 PM8/7/14
to Rich FitzJohn, ropensci...@googlegroups.com, Matt Pennell, Yihui Xie, Amy Zanne, Will Cornwell, ti...@idyll.org
Thanks Rich! some further thoughts / questions below

On Thu, Aug 7, 2014 at 3:11 PM, Rich FitzJohn <rich.f...@gmail.com> wrote:
Hi Carl,

Thanks for this!

I think that docker is always going to be for the "crazies", at least
in it's current form.  It requires running on Linux for starters -
I've got it running on a virtual machine on OSX via virtualbox, but
the amount of faffing about there is pretty intimidating.  I believe
it's possible to get it running via vagrant (which is in theory going
to be easier to distribute) but at that point it's all getting a bit
silly.  It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do.  We're down a
rabbit hole here.

Completely agree here.  Anything that cannot be installed by downloading and clicking on something is dead in the water.  It looks like Docker is just download and click on Macs  or Windows. (Haven't tested, I have only linux boxes handy).  So I'm not sure that the regular user needs to know that it's running a linux virtual machine under the hood when they aren't on a linux box.  
So I'm optimistic think the installation faffing will largely go away, if it hasn't yet.   I'm more worried about the faffing after it is installed.


I've been getting drone (https://github.com/drone/drone ) up and
running here for one of our currently-closed projects.  It uses docker
as a way of insulating the build/tests from the rest of the system,
but it's still far from ready to recommend for general use.  The
advantages I see there are: our test suite can run for several hours
without worrying about running up against allowed times, and working
for projects that are not yet open source.  It also simplifies getting
things off the container, but I think there are a bunch of ways of
doing that easily enough.  However, I'm on the lookout for something
much simpler to set up, especially for local use and/or behind NAT.  I
can post the dockerfile at some point (it's apparently not on this
computer!) but it's similarly simple to yours.

Very cool! Yeah, I think there's great promise that we'll see more easy-to-use tools being built on docker.  Is Drone ubuntu-only at the moment then?


As I see it, the great advantage of all these types of approaches,
independent of the technology, is the recipe-based approach to
documenting dependencies.  With travis, drone, docker, etc, you
document your dependencies and if it works for you it will probably
work for someone else.

Definitely.  I guess this is the heart of the "DevOpts" approach (at least according the BCE paper I linked -- they have nice examples that use these tools, but also include case studies of big collaborative science projects that do more-or-less the same thing with Makefiles.  

I think the devil is still in the details though.  One thing I like about Docker is the versioned images.  If you re-run my build scripts even 5 days from now, you'll get a different image due to ubuntu repo updates, etc.  But it's easy to pull any of the earlier images and compare.  

Contrast this to other approaches, where you're stuck with locking in particular versions in the build script itself (a la packrat) or just hoping the most recent version is good enough (a la CRAN).  

I'm OK with this being nerd only for a bit, because (like travis etc)
it's going to be useful enough without having to be generally
accessible.  But there will be ideas here that will carry over into
less nerdy activities.  One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds to work
with large data sets that are tedious to download: pull the raw data
as one RUN command, wrangle as another.  Then a separate wrangle step
will reuse the intermediate container (I believe).  This is sort of a
different way of doing the types of things that Ethan's "eco data
retriever" aims to do.  There's some overlap here with make, but in a
way that would let you jump in at a point in the analysis in a fresh
environment.

Great point, hadn't thought about that.   

I don't think that people will jump to using virtual environments for
the sake of it - there has to be some pay off.  Isolating the build
from the rest of your machine or digging into a 5 year old project
probably does not have widespread appeal to non-desk types either!

Definitely agree with that. I'd like to hear more about your perspective on CI tools though -- of course we love them, but do you think that CI has a larger appeal to the average ecologist than other potential 'benefits'?  I think the tangible payoffs are: (Cribbing heavily from that Berkeley Collaboration Environment (BCE) paper here):

1) For instructors: having students in a consistent and optimized environment with little effort.  That environment can become a resource maintained and enhanced by a larger community.  

2) For researchers: easier to scale to the cloud (assuming the tool is as easy to use on the desktop as whatever they currently do -- clearly we're not there yet).  

3) Easier to get collaborators / readers to use & re-use.   (I think that only happens if lots of people are performing research and/or teaching using these environments -- just like sharing code written in Go just isn't that useful among ecologists.  Clearly we may never get here.)



--

Carl Boettiger

unread,
Aug 8, 2014, 8:54:26 PM8/8/14
to Rich FitzJohn, ropensci...@googlegroups.com, Matt Pennell, Yihui Xie, Amy Zanne, Will Cornwell, Titus Brown
Hi folks,

Just thought I'd share an update on this thread -- I've gotten RStudio Server working in the ropensci-docker image.  

    docker -d -p 8787:8787 cboettig/ropensci-docker

will make an RStudio server instance available to you in your browser at localhost:8787.  (Change the first number after the -p to have a different address).  You can log in with username:pw rstudio:rstudio and have fun.

One thing I like about this is the ease with which I can now get an RStudio server up and running in the cloud (e.g. I took this for sail on DigitalOcean.com today).  This means in few minutes and 1 penny you have a URL that you and any collaborators could use to interact with R using the familiar RStudio interface, already provisioned with your data and dependencies in place. 

To keep this brief-ish, I've restricted further commentary to my blog notebook (today's post should be up shortly): http://www.carlboettiger.info/lab-notebook.html



Cheers,
Carl

Scott Chamberlain

unread,
Aug 12, 2014, 1:36:14 PM8/12/14
to ropensci...@googlegroups.com, Rich FitzJohn, Matt Pennell, Yihui Xie, Amy Zanne, Will Cornwell, Titus Brown
Carl, 

Awesome, nice work. 

Thoughts on whether we could wrap the docker workflow into my Digital Ocean client so that a user never needs to leave R? https://github.com/sckott/analogsea 

Scott


--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Carl Boettiger

unread,
Aug 12, 2014, 1:43:37 PM8/12/14
to ropensci...@googlegroups.com, Rich FitzJohn, Matt Pennell, Yihui Xie, Amy Zanne, Will Cornwell, Titus Brown
Great idea.  Yeah, should be possible.  Does the DO API support a way to launch a job on the instance, or otherwise a way to share a custom machine image publicly? (e.g. the way Amazon EC2 lets you make an AMI public from an S3 bucket?)  

I suspect we can just droplets_new() with the ubuntu_docker image they have, but that we would then need a wrapper to ssh into the DO machine and execute the single command needed to bring up the RStudio instance in the browser.  

Scott Chamberlain

unread,
Aug 12, 2014, 2:01:49 PM8/12/14
to ropensci...@googlegroups.com, Rich FitzJohn, Matt Pennell, Yihui Xie, Amy Zanne, Will Cornwell, Titus Brown
Hmm, looks like DO is planning on it, but not possible yet. Do go upvote this feature https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/3249642-share-an-image-w-another-account

Nice, we could work on this working privately, then when sharing is available, boom. 

Karthik Ram

unread,
Aug 12, 2014, 2:07:10 PM8/12/14
to ropensci...@googlegroups.com, Rich FitzJohn, Matt Pennell, Yihui Xie, Amy Zanne, Will Cornwell, Titus Brown
Yeah, looks like DO doesn't have it yet. I'm happy to leave EC2 to support the little guy. But as with anything, there is a huge diversity of AMIs and greater discoverability on EC2, at least for now.

John Stanton-Geddes

unread,
Sep 10, 2014, 9:07:46 AM9/10/14
to ropensci...@googlegroups.com, rich.f...@gmail.com, mwpe...@gmail.com, xiey...@gmail.com, aez...@gmail.com, w.cor...@unsw.edu.au, ti...@idyll.org
Hi Carl and rOpenSci,

Apologies for jumping in late here (and let me know if this should be asked elsewhere or a new topic) but I've also recently discovered and become intrigued by Docker for facilitating reproducible research. 

My question: what's the advantage of Docker over an amazon EC2 machine image? 

I've moved my analyses to EC2 for better than my local university cluster. Doesn't my machine image achieve Carl's acid test of allowing others to build and extend on work? What do I gain by making a Dockerfile on my already existing EC2 image?  Being new to all this, the only clear advantage I see is a Dockerfile is much smaller than a machine image, but this seems like a rather trivial concern in comparison to 100s of gigs of sequence data associated with my project.

thanks,
John

Carl Boettiger

unread,
Sep 10, 2014, 12:40:46 PM9/10/14
to ropensci...@googlegroups.com, Rich FitzJohn, Matt Pennell, Yihui Xie, Amy Zanne, Will Cornwell, Titus Brown
Hi John,

Nice to hear from you and thanks for joining the discussion. You ask
a very key question that ties into a much more general discussion
about reproducibility and virtual machines. Below I try and summarize
what I think are the promising features of Docker. I don't think this
means it is *the solution*, but I do think it illustrates some very
useful steps forward to important issues in reproducibility and
virtualization. Remember, Docker is still a very young and rapidly
evolving platform.

1) Remix. Titus has an excellent post, "Virtual Machines Considered
Harmful for reproducibility" [1] , essentially pointing out that "you
can't install an image for every pipeline you want...". In contrast,
Docker containers are designed to work exactly like that -- reusable
building blocks you can link together with very little overhead in
disk space or computation. This more than anything else sets Docker
apart from the standard VM approach.

2) Provisioning scripts. Docker images are not 'black boxes'. A
"Dockerfile" is a simple make-like script which installs all the
software necessary to re-create ("provision") the image. This has
many advantages: (a) The script is much smaller and more portable than
the image. (b) the script can be version managed (c) the script gives
a human readable (instead of binary) description of what software is
installed and how. This also avoids pitfalls of traditional
documentation of dependencies that may be too vague or out-of-sync.
(d) Other users can build on, modify, or extend the script for their
own needs. All of this is what we call the he "DevOpts" approach to
provisioning, and can be done with AMIs or other virtual machines
using tools like Ansible, Chef, or Puppet coupled with things like
Packer or Vagrant (or clever use of make and shell scripts).

For a much better overview of this "DevOpts" approach in the
reproducible research context and a gentle introduction to these
tools, I highly recommend taking a look at Clark et al [2].

3) You can run the docker container *locally*. I think this is huge.
In my experience, most researchers do their primary development
locally. By running RStudio-server on your laptop, it isn't necessary
for me to spin up an EC2 instance (with all the knowledge & potential
cost that requires). By sharing directories between Docker and the
host OS, a user can still use everything they already know -- their
favorite editor, moving files around with the native OS
finder/browser, using all local configurations, etc, while still
having the code execution occur in the container where the software is
precisely specified and portable. Whenever you need more power, you
can then deploy the image on Amazon, DigitalOcean, a bigger desktop,
your university HPC cluster, your favorite CI platform, or wherever
else you want your code to run. [On Mac & Windows, this uses
something called boot2docker, and was not very seamless early on. It
has gotten much better and continues to improve.

4) Versioned images. In addition to version managing the Dockerfile,
the images themselves are versioned using a git-like hash system
(check out: docker commit, docker push/pull, docker history, docker
diff, etc). They have metadata specifying the date, author, parent
image, etc. We can roll back an image through the layers of history
of its construction, then build off an earlier layer. This also
allows docker to do all sorts of clever things, like avoiding
downloading redundant software layers from the docker hub. (If you
pull a bunch of images that all build on ubuntu, you don't get n
copies of ubuntu you have to download and store). Oh yeah, and
hosting your images on Docker hub is free (no need to pay for an S3
bucket... for now?) and supports automated builds based on your
dockerfiles, which acts as a kind of CI for your environment.
Versioning and diff\ing images is a rather nice reproducibility
feature.

[1]: http://ivory.idyll.org/blog/vms-considered-harmful.html
[2]: https://berkeley.box.com/s/w424gdjot3tgksidyyfl

If you want to try running RStudio server from docker, I have a little
overview in: https://github.com/ropensci/docker


Cheers,

Carl

John Stanton-Geddes

unread,
Sep 10, 2014, 1:49:47 PM9/10/14
to ropensci...@googlegroups.com, rich.f...@gmail.com, mwpe...@gmail.com, xiey...@gmail.com, aez...@gmail.com, w.cor...@unsw.edu.au, ti...@idyll.org
Thanks for clarifying, Carl. I'm now fairly convinced that Docker is worth the extra cost (in time, etc) as it provides explicit instructions ('recipe') for the container.

My one residual concern, which is more practical/technological than (open sci) philosophical is that I still have to be using a system that I can install Docker on to get Docker to work. This is relevant as I can't (easily) install Docker on my 32-bit laptop as it's only supported for 64-bit. If I go through the (not always necessary) effort of spinning up an AMI, I can access it through anything with ssh. The easy solution is to run Docker on the AMI.

Titus also responded directly to me with the following:

the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really matter
*which* approach you choose, so much as whether or not the approach you do
choose permits understanding and remixing.  So I would argue that neither an
AMI nor a fully-baked Docker image is sufficient; what I really want is a
*recipe*.  In that sense the Docker community seems to be doing a better job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which is the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.

Carl Boettiger

unread,
Sep 10, 2014, 3:14:03 PM9/10/14
to ropensci...@googlegroups.com, Amy Zanne, Rich FitzJohn, ti...@idyll.org, mwpe...@gmail.com, xiey...@gmail.com, w.cor...@unsw.edu.au

John,

Thanks again for your input.  Yeah, lack of support for 32 bit hosts is a problem; though since you were speaking about AMIs I imagine you were already used to not working locally, so you can of course try it out on an amazon image or digitalocean droplet. 

Yeah, Titus makes a great point.  If we only distributed docker images as 2 GB binary tar files, we'd not be doing much better on the open / remixable side than a binary VM image. And docker isn't the only way to provide a this kind of script, as I mentioned earlier. 

Nevertheless, I believe there is a technical difference and not just a cultural  one.  Docker is not a virtual machine; containers are designed expressly to be remixable blocks.  You can put an R engine on one container and a mysql database on another and connect them. Docker philosophy aims at one function per container to maximize this reuse.  of course it's up to you to build this way rather than a single monolithic dockerfile, but the idea of linking containers is a technical concept at the heart of docker that offers a second and very different way to address the 'remix' problem of VMs. 

---
Carl Boettiger
http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos

Carsten Behring

unread,
Sep 27, 2014, 6:22:50 PM9/27/14
to ropensci...@googlegroups.com, aez...@gmail.com, rich.f...@gmail.com, ti...@idyll.org, mwpe...@gmail.com, xiey...@gmail.com, w.cor...@unsw.edu.au
Hi everybody,

some of you mentioned in the previous posts, that the hurdle of using docker is still to high for a lot of researchers.

One way to lower the hurdle  is to use the cloud.

So instead of asking collaborators to install and use docker on their PCs in order to reproduce something, we do either:

1. We install (with the help of an little automatic tool using cloud services API's) our docker image to share in the "cloud"  (digitalocean / Amazon EC2 others).
The we only send ip address and port to others and then they can start immediately in Rstudio Browser environment.
In this case the distributor pays .... for the cloud service

2. We send to a collaborator the name of our image (and the registry) and he uses the same little tool to generate a cloud server containing the RStudio environment.

-> Their are costs involved for the cloud providers.

In both cases the same tool gets as input:

- docker image name (based on RStudio, containing all data / R code of the analysis)
- cloud provider credentials (for billing ...)

and it returns:
- IP address and port of RStudio, ready to use

I did a proof of concept for this with digitalocean and the docker image mgymrek/docker-reproducibility-example.


With  a simple call to digital ocean API, like this:

(create-droplet "my token"  {:name "core1":region "ams3" :size "512mb" :image 6373176 :ssh_keys [42550] :user_data user-data})

where the "user-data" contains some info for coreos operating system to start a certain docker image on boot:

#cloud-config

coreos:
    units:
      - name: docker-rs.service
        command: start
        content: |
          [Unit]
          Description=RStudio service container
          Author=Me
          After=docker.service

          [Service]
          Restart=always
          ExecStart=/usr/bin/docker run -p 49000:8787  --name "rstudio" mgymrek/docker-reproducibility-example
          ExecStop=/usr/bin/docker stop rstudio


and, voila, on boot it starts RStudio on a new digitalocean server, ready to use.

It should work the same way for Amazon EC2 or others.
So the tool could allow to select the cloud provider.


I am pretty sure as well, that a tool could be done as well which does the same for the installation on local PC. (Windows, Linux, OS2)

I will start some development / info here: https://github.com/behrica/ropensciCloud

The big added value of docker compared to classical virtual machines is, that it solved the distribution problem of the images.
By just specifying "image name", "tag" and "registry" (if not docker hub is used) each docker client knows how to get the image.


By using common base images it would be even very fast to download the images (after the first download happened)

Maybe ropensci could host its own image registry somewhere...



Casten



>>>> For more options, visit https://groups.google.com/d/</u
...

Carl Boettiger

unread,
Sep 28, 2014, 5:49:17 PM9/28/14
to ropensci...@googlegroups.com, Amy Zanne, Rich FitzJohn, Titus Brown, Matt Pennell, Yihui Xie, Will Cornwell
Casten,

Thanks for joining the discussion and sharing your experiences, it's
really nice to see how others are using these approaches.

I agree entirely that cloud platforms like digitalocean give a really
nice user experience coupled with docker and RStudio-server.
Certainly ropensci could host it's own hub, but the Docker Hub works
rather nicely already so I'm not sure what the advantage of that might
be? I also agree that docker has many advantages for reproducible
research, though it is worth noting that other 'classical' virtual
machines can offer a very similar solution to the distribution problem
you describe -- e.g. Vagrant Hub.

Nonetheless, I still see running docker locally as an important part
of the equation. In my experience, most researchers still do some
(most/all) of their development locally. If we are to have a
containerized, portable, reproducible development environment, that
means running docker locally (as well as in the cloud).

The reproducible research angle has the most to gain when people can
both build on existing Dockerfiles / docker images, as you mention,
but also when they can write their own Dockerfiles particular to their
packages. I don't think that's too high a barrier -- it's easier than
writing a successful .travis.yml or other CI file for sure -- but it
does mean being able to do more than just access an RStudio server
instance that just happens to be running in a docker container.

On a linux machine, this kind of containerized workflow is remarkably
seamless. I think boot2docker has come a long way in making this
easier, but not there yet.

To that end, we're still doing lots of work on defining some useful
base Dockerfiles and images that others could build from, including
the rstudio example. Current development is at
https://github.com/eddelbuettel/rocker

Thanks for sharing the API call example, that's awesome! (btw, Scott
has an R package in the works for the digitalocean API that may also
be of interest: https://github.com/sckott/analogsea).

Carsten Behring

unread,
Sep 29, 2014, 4:12:09 AM9/29/14
to ropensci...@googlegroups.com, aez...@gmail.com, rich.f...@gmail.com, ti...@idyll.org, mwpe...@gmail.com, xiey...@gmail.com, w.cor...@unsw.edu.au
Dear Carl,

thanks for your support and comments.

I would like to reply to some of you points.

In a lot of organisations, getting Linux or even docker on windows is impossible to achieve. Security concerns let a lot of admins not touch the corporate PCs to install "strange" applications.
This will get worse in the coming years in my view.
That is why I would like to envision as well a reproducible research workflow, which is completely independent of any local installed software,
and it only needs a web browser.

I believe that in certain organisations it is easier to get a Creditcard and budget to pay cloud computing services, then get "special virtualisation software like VMWare/Docker" on the standard corporate windows PC of the user.
This means, we need to come to solutions which contains "RStudio in the cloud" as one possible computing engine.
The same argumentation is true  for getting access to "fast hardware with a lot of memory".

Regarding the usage of docker hub:

Docker hub is for sure the best place for the kind of base images and its Dockerfiles you are working on.

I was thinking that we could envision that "each individual study / analysis" could be published as a docker image.
In this case, I would say docker hub is not the right place to store all of those.
It would be nice to have a specific registry only to share "docker images of statistical analysis" which could contain different features for searching and son on.

So my ideal scenario would be this:

1. There would be one (or even several) docker registries dedicated to "docker images containing Rstudio based images with individual analysis projects"
2. Having the images there, means a user with a local docker installation can use them as usual with his local Docker client
3. A user without a docker installation can "press a button" and he gets automatically a new cloud server (Digitalocean, Amazon Ec2, others) containing the Rstudio based image,
by giving his authentication data (so he pays for it). So he can login immediately and look (and change) the analysis directly in the cloud.

What would be missing is, that in case 3), the user can not easily republish his changes as a new Docker image.  But this is solvable. It would need a R package which can interact with the running cloud server (over ssh ...) and re-creates and re-publishes a new version of the image on request of the user.

So in this scenario, setting up somewhere a "docker hub based" registry for "RStudio based reproducible research images" would be the starting point.

 Please provide me with any comments you might have.

Carsten Behring

unread,
Sep 29, 2014, 11:29:33 AM9/29/14
to ropensci...@googlegroups.com, aez...@gmail.com, rich.f...@gmail.com, ti...@idyll.org, mwpe...@gmail.com, xiey...@gmail.com, w.cor...@unsw.edu.au
Dear all,

I deployed the very first version of my docker-to-the-cloud application.
It basically takes the name of an existing docker images and starts a new digitalocean server, on which the image gets automatically installed and started.

If you use an image based on this docker file:



It should make the rstudio available on port 49000 (mapped from 8787) of the newly created droplet.

For the billing to work, you need to put your digitalocean token.

A  valid ssh_id is needed as well.
In theory we do not need ssh, (nothing gets done via ssh), but the coreos digitalocean image I use, requires an ssh_id to be specified.


Example parameters could be:

Token: basidbasiucbauiocbaobca   (long string)
image name: cboettig/ropensci
ssh_id: 12345    (is only visible via api ... to be replaced by "key name" soon)

Send after some minutes the Rstudio should be available at  http:// IP:49000

The code of the app is available here:


Please try it out and provide me with any comments you might have.

...

Carl Boettiger

unread,
Sep 29, 2014, 12:22:58 PM9/29/14
to ropensci...@googlegroups.com, Amy Zanne, Rich FitzJohn, Titus Brown, Matt Pennell, Yihui Xie, Will Cornwell
Carsten,

Thanks for your reply, you bring up really great points about the
realities of PC environments that I'm not really in touch with. It's
easy for me to get stuck just thinking about the small academic lab
context so it's great that you can help us think more big picture
here.

Also good points about an archival repository. For those following
along, one of the nice things about Docker is that the software that
runs the Docker Hub, docker-registry, is open source, so anyone can
host their own public or private hub. Easy sharing is a key feature
that I think has helped make Docker successful and a compelling
element for reproducible research.

While I see your point that the Docker Hub might not be ideal for all
cases, I think the most important attribute of a repository should be
longevity. Certainly Docker Hub won't be around forever, but at this
stage with 60 million in it's latest VC round it's likely to be more
long-lasting than anything that a small organization like rOpenSci
would host. It would be great to see existing scientific repositories
show an interest in archiving images in this way though, since
organizations like DataONE and Dryad already have recognition in the
scientific community and better discoverability / search / metadata
features. Building of the docker registry technology would of course
make a lot more sense than just archiving static binary docker images,
which lack both the space saving features and the ease of download /
integration that examples like the docker registry have.

I think it would be interesting to bring that discussion to some of
the scientific data repositories and see what they say. Of course
there's the chicken & egg problem in that most researchers have never
heard of docker. Would be curious what others think of this.

Cheers,

Carl

p.s. Nice proof of principle with the herokuapp by the way, but I'm
missing something about the logic here. if I just go to
https://blooming-lake-3277.herokuapp.com with the parameters you
provide, I'm told it can't authenticate. Am I supposed to be running
the app on my own droplet instead? Sorry, a bit over my head.

Carsten Behring

unread,
Sep 30, 2014, 3:36:56 AM9/30/14
to ropensci...@googlegroups.com, aez...@gmail.com, rich.f...@gmail.com, ti...@idyll.org, mwpe...@gmail.com, xiey...@gmail.com, w.cor...@unsw.edu.au
Carl,

regarding the authentication for the app, I will check.

Its the first time, I used heroku myself.
I thought by default an application on heroku is public....
(It works for me, probably because I am logged into heroku)

Can you send me a screenshot of the error message ?

Carsten


>> >> >>>>>> >>...

Carsten Behring

unread,
Oct 2, 2014, 3:48:26 PM10/2/14
to ropensci...@googlegroups.com, aez...@gmail.com, rich.f...@gmail.com, ti...@idyll.org, mwpe...@gmail.com, xiey...@gmail.com, w.cor...@unsw.edu.au
Ciao Carl,

the application is public, so maybe you just put wrong data.

You need to put your digitalocean api token and the name of one of our ssh keys from digital ocean.

Carsten
>> >> >&
...

Carsten Behring

unread,
Oct 6, 2014, 5:13:33 PM10/6/14
to ropensci...@googlegroups.com, aez...@gmail.com, rich.f...@gmail.com, ti...@idyll.org, mwpe...@gmail.com, xiey...@gmail.com, w.cor...@unsw.edu.au
Dear all,

my use case of "simple use of docker, cloud + rstudio" is largely satisfied by the digitalocean R package under development here:


It is digitalocean specific, but that's good enough for a start.

A simple cloud-hosted (shiny) web application using this library might even ease further usage.

Regards
Reply all
Reply to author
Forward
0 new messages