Hi rOpenSci list + friends [^1],
Yay, the ropensci-discuss list is revived!
Some of you might recall a discussion about reproducible research in the comments of Rich et al’s recent post on the rOpenSci blog where quite a few of people mentioned the potential for Docker as a way to facilitate this.
I’ve only just started playing around with Docker, and though I’m quite impressed, I’m still rather skeptical that non-crazies would ever use it productively. Nevertheless, I’ve worked up some Dockerfiles to explore how one might use this approach to transparently document and manage a computational environment, and I was hoping to get some feedback from all of you.
For those of you who are already much more familiar with Docker than me (or are looking for an excuse to explore!), I’d love to get your feedback on some of the particulars. For everyone, I’d be curious what you think about the general concept.
So far I’ve created a dockerfile and image
If you have docker up and running, perhaps you can give it a test drive:
docker run -it cboettig/ropensci-docker /bin/bash
You should find R installed with some common packages. This image builds on Dirk Eddelbuettel’s R docker images and serves as a starting point to test individual R packages or projects.
For instance, my RNeXML manuscript draft is a bit more of a bear then usual to run, since it needs rJava (requires external libs), Sxslt (only available on Omegahat and requires extra libs) and latest phytools (a tar.gz file from Liam’s website), along with the usual mess of pandoc/latex environment to compile the manuscript itself. By building on ropensci-docker, we need a pretty minimal docker file to compile this environment:
You can test drive it (docker image here):
docker run -it cboettig/rnexml /bin/bash
Once in bash, launch R and run rmarkdown::render("manuscript.Rmd"). This will recompile the manuscript from cache and leave you to interactively explore any of the R code shown.
Being able to download a precompiled image means a user can run the code without dependency hell (often not as much an R problem as it is in Python, but nevertheless one that I hit frequently, particularly as my projects age), and also without altering their personal R environment. Third (in principle) this makes it easy to run the code on a cloud server, scaling the computing resources appropriately.
I think the real acid test for this is not merely that it recreates the results, but that others can build and extend on the work (with fewer rather than more barriers than usual). I believe most of that has nothing to do with this whole software image thing — providing the methods you use as general-purpose functions in an R package, or publishing the raw (& processed) data to Dryad with good documentation will always make work more modular and easier to re-use than cracking open someone’s virtual machine. But that is really a separate issue.
In this context, we look for an easy way to package up whatever a researcher or group is already doing into something portable and extensible. So, is this really portable and extensible?
This presupposes someone can run docker on their OS — and from the command line at that. Perhaps that’s the biggest barrier to entry right now, (though given docker’s virulent popularity, maybe something smart people with big money might soon solve).
The only way to interact with thing is through a bash shell running on the container. An RStudio server might be much nicer, but I haven’t been able to get that running. Anyone know how to run RStudio server from docker?
(I tried & failed: https://github.com/mingfang/docker-druid/issues/2)
1) Docker is just one of many ways to do this (particularly if you’re not concerned about maximum performance speed), and quite probably not the easiest. Our friends at Berkeley D-Lab opted for a GUI-driven virtual machine instead, built with Packer and run in Virtualbox, after their experience proved that students were much more comfortable with the mouse-driven installation and a pixel-identical environment to the instructor’s (see their excellen paper on this).
2) Will/should researchers be willing to work and develop in virtual environments? In some cases, the virtual environment can be closely coupled to the native one — you use your own editors etc to do all the writing, and then execute in the virtual environment (seems this is easier in docker/vagrant approach than in the BCE.
[^1]: friends cc’d above: We’re reviving this ropensci-discuss list to chat about various issues related to our packages, our goals, and more broad scientific workflow issues. I’d encourage you to sign up for the listserve: https://groups.google.com/forum/#!forum/ropensci-discuss
—
Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/
Hi Carl,
Thanks for this!
I think that docker is always going to be for the "crazies", at least
in it's current form. It requires running on Linux for starters -
I've got it running on a virtual machine on OSX via virtualbox, but
the amount of faffing about there is pretty intimidating. I believe
it's possible to get it running via vagrant (which is in theory going
to be easier to distribute) but at that point it's all getting a bit
silly. It's enlightening to ask a random ecologist to go to the
website for docker (or heroku or vagrant or chef or any of these
newfangled tools) and ask them to guess what they do. We're down a
rabbit hole here.
I've been getting drone (https://github.com/drone/drone ) up and
running here for one of our currently-closed projects. It uses docker
as a way of insulating the build/tests from the rest of the system,
but it's still far from ready to recommend for general use. The
advantages I see there are: our test suite can run for several hours
without worrying about running up against allowed times, and working
for projects that are not yet open source. It also simplifies getting
things off the container, but I think there are a bunch of ways of
doing that easily enough. However, I'm on the lookout for something
much simpler to set up, especially for local use and/or behind NAT. I
can post the dockerfile at some point (it's apparently not on this
computer!) but it's similarly simple to yours.
As I see it, the great advantage of all these types of approaches,
independent of the technology, is the recipe-based approach to
documenting dependencies. With travis, drone, docker, etc, you
document your dependencies and if it works for you it will probably
work for someone else.
I'm OK with this being nerd only for a bit, because (like travis etc)
it's going to be useful enough without having to be generally
accessible. But there will be ideas here that will carry over into
less nerdy activities. One that appeals to me would be to take
advantage of the fancy way that Docker does incremental builds to work
with large data sets that are tedious to download: pull the raw data
as one RUN command, wrangle as another. Then a separate wrangle step
will reuse the intermediate container (I believe). This is sort of a
different way of doing the types of things that Ethan's "eco data
retriever" aims to do. There's some overlap here with make, but in a
way that would let you jump in at a point in the analysis in a fresh
environment.
I don't think that people will jump to using virtual environments for
the sake of it - there has to be some pay off. Isolating the build
from the rest of your machine or digging into a 5 year old project
probably does not have widespread appeal to non-desk types either!
--
You received this message because you are subscribed to the Google Groups "ropensci-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ropensci-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
the argument here --
ivory.idyll.org/blog/vms-considered-harmful.html
-- which I have apparently failed to make simply and clearly, judging
by the hostile reactions over the years ;) -- is that it doesn't really matter
*which* approach you choose, so much as whether or not the approach you do
choose permits understanding and remixing. So I would argue that neither an
AMI nor a fully-baked Docker image is sufficient; what I really want is a
*recipe*. In that sense the Docker community seems to be doing a better job of
setting cultural expectations than the VM community: for Docker, typically
you provide some sort of install recipe for the whole thing, which is the
recipe I'm looking for.
tl; dr? No technical advantage, but maybe different cultural expectations.
John,
Thanks again for your input. Yeah, lack of support for 32 bit hosts is a problem; though since you were speaking about AMIs I imagine you were already used to not working locally, so you can of course try it out on an amazon image or digitalocean droplet.
Yeah, Titus makes a great point. If we only distributed docker images as 2 GB binary tar files, we'd not be doing much better on the open / remixable side than a binary VM image. And docker isn't the only way to provide a this kind of script, as I mentioned earlier.
Nevertheless, I believe there is a technical difference and not just a cultural one. Docker is not a virtual machine; containers are designed expressly to be remixable blocks. You can put an R engine on one container and a mysql database on another and connect them. Docker philosophy aims at one function per container to maximize this reuse. of course it's up to you to build this way rather than a single monolithic dockerfile, but the idea of linking containers is a technical concept at the heart of docker that offers a second and very different way to address the 'remix' problem of VMs.
---
Carl Boettiger
http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos
...>>>> For more options, visit https://groups.google.com/d/</u
...
>> >> >>>>>> >>...
...>> >> >&
...>> >> >>>>>> <a href="http://www.carlboettiger.info/lab-notebook.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.carlboettiger.info%2Flab-notebook.html\46sa\75D\46sntz\0751\46usg\75AFQjCNExAhHuhPBkKJtqUMjNvjhMntggqw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.carlboettiger.info%2Flab-notebook.ht