Re: HashDist

39 views
Skip to first unread message

Aron Ahmadia

unread,
Feb 13, 2015, 4:16:56 AM2/13/15
to Colin Jermain, hash...@googlegroups.com
On Tue, Feb 10, 2015 at 11:31 PM, Colin Jermain <cjer...@gmail.com> wrote:
Hey Aron,

I've been looking into HashDist since your mention of it in IPython Issue #7715. From the documentation available its still not clear exactly what HashDist provides. The tutorial page is missing. I did find your SciPy 2014 lecture which gave me some idea of the concept, but I'm still not certain how close to fruition the project is, what the overall goals are, or what needs to be done to get it there.

I would be very interested if HashDist is providing a self-contained virtual-environment that extends beyond Python. I am often trying to share my Python/Cython code with others who have non-Linux systems and I am frustrated by the difficulty they encounter when trying to acquire the dependencies. I would also like to explicitly define the dependencies of an IPython Notebook, to allow others to correctly reconstruct the environment that I am using. The pip installer combined with virtualenv is a nice solution, except for the gfortran/gcc requirements (including headers for many libraries that on Linux require additional dev packages). So far, I've suggested people use Enthought or Anaconda distributions, but I dislike the access control that is given to users in these environments and their restriction to certain versions. HashDist appears to have nice features in pointing directly to commits on open-source packages and creating a cross-language record of the dependencies.

Is the virtual environment the correct analogy to HashDist?
Does HashDist package all dependencies into a single directory for each "virtual environment"?
Are dependencies shared between these, or are they re-included in each environment? i.e. Is there reuse of dependencies?
Is there a tutorial/example of using HashDist?

Hi Colin,

Thanks for the email and your interest.  I really appreciate it.  I'm cc-ing the main developer list in the hopes that I'll get some helps answering your questions.  You can find us in our gitter chatroom: https://gitter.im/hashdist/public, on this mailing list, and the hashdist/hashstack issue tracker.

I'm going to copy/paste a few things from my private reply here, and then add a few more notes at the bottom:

As I mentioned, we're in a pretty stable beta at this point.  The project is used by the Proteus team at the US Army Corps of Engineers, a group in Los Alamos National Labs, and the FEniCS developers at University of Oslo, as well as a handful of other people.  The Sage project is currently evaluating a new build system using our tool, but I haven't heard any news on that lately.

One of our promises is to provide self-contained build environments that extend beyond Python.  The project was actually born out of desires to help install "non-trivial" dependencies across both personal computers and high performance clusters.  As a result, we try to support common operating systems and to avoid requiring administrative access.

We've currently built up a list of over a hundred packages that we can install on Linux, OS X, and Cygwin.  As you can guess, there's an important "substrate" of libraries such as LAPACK as well as other dependencies that need to be satisfied in different ways on different operating systems.  Our current approach allows us to provide source-driven builds to developers and users.  If you can use the same hashdist build store path on another computer (say, /opt/hashdist/cache), then you can simply copy over builds from one computer to the next, and there's a good chance they'll work!

HashDist has the best tools for reproducibility/provenance that I'm aware of.

We tend to think of software profiles instead of virtual environments, although these things all tend to be somewhat related/muddled.

We isolate each package build (into the HashDist build store), then give you a set of links to a guaranteed "correct" environment, reducing wasted disk space/builds when you have a package that can be used in multiple environments.

With regards to a tutorial on how to get started, I think this was your request:

It would be great to create a walk-through tutorial to show how to use HashDist to install: numpy, scipy, pandas, matplotlib, and IPython (& notebook) -- and then describe using it within the notebook. I will look into more detail of the IPython plug-in that you published on GitHub. Perhaps the tutorial goal could be to plot a Gaussian and its integral with matplotlib in the IPython Notebook on a system that does not have anything of the dependencies installed.

I haven't written anything down of this nature, but I agree that it would be a pretty high priority to have and that is a nice example, since we already have a functioning SciPy stack that works well on Linux, OS X, and Cygwin systems.  Our current requirement on all three operating systems is that the user has gfortran installed if they need SciPy, although Ondrej has some interesting examples where he bootstraps a compiler through a hashstack profile, and if we get binary installs working, this becomes feasible to do swiftly (on the order of a minute or so to download and make available).

At minimum, HashDist needs a working compiler, Git, and Python to build things.  Right now, we require gfortran is also already installed, but that can go away with some improvement to our support for relocatable binaries.  There are grander challenges in supporting scientific Python stacks "natively" on Windows without the use of Cygwin. 

Would the tutorial still be of value to you?

Regards,
Aron

Aron Ahmadia

unread,
Feb 13, 2015, 11:19:53 AM2/13/15
to Colin Jermain, hash...@googlegroups.com

On Fri, Feb 13, 2015 at 11:10 AM, Colin Jermain <cjer...@gmail.com> wrote:

I played around with HashDist briefly and made a build. Coming from virtualenv I'm expecting a "source venv/bin/activate" type command to set the environment variables. Its still not clear to me how this is done in HashDist after a cursory read through the User's guide, or whether there is a different method. What am I missing?


Hi Colin,

Thanks for the feedback.  I'm not sure if we've got enough mass appeal for web developers, since there are so many other products in that space that are more targeted towards them.  That said, I agree that the home page needs more work.

When you type `source venv/bin/activate`, a simple script corresponding to your shell sets your PATH environment variable to put `/path/to/venv/bin` ahead on your path.  Ondrej has proposed a similar set of scripts for HashDist, but they are not implemented yet.  I usually set up a shell alias to do this for me.  It looks something like:

# on zsh
path=(~/hashstack/project/bin $path)

# on bash
export PATH=~/hashstack/project/bin:$PATH

Maybe a "HashDist for virtualenv users" section of the user guide or FAQ page would be useful.

A

Jimmy Tang

unread,
Feb 16, 2015, 4:04:36 AM2/16/15
to hash...@googlegroups.com, cjer...@gmail.com, ar...@ahmadia.net
Hi All,

When I started to play with hashdist/hashstack for my own stuff, I had similar problems with the docs as Colin is having now. I think a short walk-through of maybe 2 or 3 scenarios with different types of users as a how to would be good.

Maybe something like

Scenario 1 - Alice - the lone researcher who needs an updated version of petsc on her laptop
Scenario 2 - Bob - the student of Alice who wants to take Alice's stack to run on a HPC cluster
Scenario 3 - Charles - the sys-admin trying to re-create a stack from Alice for debugging system problems of a library

Jimmy 

Colin Jermain

unread,
Feb 16, 2015, 9:48:22 AM2/16/15
to Jimmy Tang, hash...@googlegroups.com, ar...@ahmadia.net
Agreed. These 3 scenarios would be greatly beneficial.

On 02/16/2015 04:04 AM, Jimmy Tang wrote:
> Scenario 1 - Alice - the lone researcher who needs an updated version
> of petsc on her laptop
> Scenario 2 - Bob - the student of Alice who wants to take Alice's
> stack to run on a HPC cluster
> Scenario 3 - Charles - the sys-admin trying to re-create a stack from
> Alice for debugging system problems of a library

I really like Ondrej's proposal for profile loading, and Dag Sverre's
syntax for use (hit env profilename). Since I'm looking for a way to
distribute code to people who do not have significant shell experience,
I would prefer to not require them to set environment variables. Instead
the hit env command would be a great and simple alternative. What design
decisions need to be made before this can be implemented?

Regards,

Colin
Reply all
Reply to author
Forward
0 new messages