hashdist/conda integration

73 views
Skip to first unread message

Chris Kees

unread,
Feb 16, 2017, 3:16:11 PM2/16/17
to hash...@googlegroups.com
Group,

It may be possible to fund some work on hashdist/hashstack and conda/conda-forge in the near future, so I'd like to start a discussion on tasks and priorities. Anybody interested in a round table discussion on skype or g+ in the next two weeks?  I'll be at SIAM CS&E in case anyone else will be there. Here's a list of proposed improvements:

- better support for host/vendor dependencies in the HPC environment, specifically integration with the module system in the HPC realm and better interaction with host dependencies and compilers
- better support for binary package management on standard platforms and HPC platforms, with flexible, secure package mirrors/channels
- an improved CLI that allows most users to manage their stack in a more traditional "stateful" way (allowing `pip install` or `conda install` and switching profiles/environments more easily)
- better integration with docker builds, particularly attention to cache-busting and taking advantage of the precise dependency management of hashdist
- incremental improvements in debugging package builds, communicating yaml parser errors
- better windows and mac support

My perspective
--------------------

What drove the development of hashdist was the need for customized, reproducible builds of scientific Python stacks, with many non-Python dependencies and specifically vendor/host dependencies. In particular, supporting those builds in a platform-independent way across machines in a _slightly_ challenging HPC environment. That need still exists and hasn't been addressed well by other tools that I know of.

I'm currently dependent on my branch of hashdist (https://github.com/hashdist/hashdist/pull/314) to build python 2.7.x plus a bunch of other C/C++/Fortran/Python packages spanning parallel numerics, i/o, mesh generation, and geospatial analysis in support of the proteus toolkit for computational modeling and simulation (proteustoolkit.org), which is a Python package. The stack currently builds on most of the DoD HPC systems and a handful of academic and European clusters. We also use it on linux and mac development machines and in building docker images used within jupyterhub, binder, and travis. The hashdist branch I use has simple support for relocating binary packages on posix systems, and I rely on the binary installs in the docker, HPC, and travis context where you don't really want to do be doing source builds of the entire dependency tree most of the time. The binaries are simple tarballs with some location info pushed to remote caches with the `hit remote` command and built automatically using the buildbot CI tool.  It's a far cry from user-friendly binary package management support, but it works.

On the other hand, conda is an open-source binary package management tool that is platform independent and supports a nice ecosystem of recipes and binary packages for linux, mac, and windows via conda-forge.  It provides a CLI for managing a user's Python environment in the stateful way that most users expect.

I have some ideas on how we could proceed with integration hashdist and conda, but I thought I'd  survey the list about interested in collaboration first.

Thanks,
Chris


Ondřej Čertík

unread,
Feb 28, 2017, 1:13:56 PM2/28/17
to hash...@googlegroups.com
Hi Chris,
I already commented privately, but I'll post publicly as well: Yes, I
am excited about this development and would like to see it succeed.

In terms of priorities, I would like to first see hashdist generate
Conda recipes that Conda can build. Then go from there by improving to
the process and user experience.

It is my understanding that there is no real difference to how
Hashdist or Conda handle the environments, they both link them in
similar way. Regarding "statefulness", I think both Hashdist and Conda
allows it in the same way --- the development profile in Hashdist
allows to install custom things into it, Conda handles it in similar
way. Neither allows modification of the installed locations of the
individual packages, as far as I know. So there is no real difference.

Ondrej

Volker Braun

unread,
Feb 28, 2017, 4:22:41 PM2/28/17
to hashdist
On Tuesday, February 28, 2017 at 7:13:56 PM UTC+1, ondrej.certik wrote: 
[...] the development profile in Hashdist
allows to install custom things into it, Conda handles it in similar
way. Neither allows modification of the installed locations of the
individual packages, as far as I know.

I'd say the main difference is that hashdist uses symlinks so programs essentially run from ~/.hashdist while conda copies binaries (and patches paths in the binaries). This makes conda profiles simpler and more flexible, e.g. conda has binary packages and you can modify installed files as you want. I'm aware that this is not entirely clear-cut, hashdist has the development profile and conda can also use links. Still, hashdist doesn't work on (at least some) ARM boxes because patchelf segfaults (fragile alternative to just patching binaries) and e.g. "./default/bin/pip install --upgrade pip" dies with an "OSError: Cannot call rmtree on a symbolic link" even for a development profile.

The simplest "conda integration" would be just to pad the build paths in ~/.hashdist and then tar.bz2 up the artifacts with some metadata files (https://conda.io/docs/spec.html). Just have hashdist spit out a bunch of conda packages!

Aron Ahmadia

unread,
Feb 28, 2017, 5:23:53 PM2/28/17
to hashdist
I can also provide some support in navigating the code base. I think there is general agreement that Hashdist as package builder is the right metaphor. It might even be possible to design a bridging plugin to conda-build that allows it to use hashdist under the hood to manage more complex builds. Push more of the complexity out of the profiles and into the package specifications.

I don't have a great use case right now for hashdist (or even conda build), but I'm planning to be at scipy this year if that's a convenient place for folks to get together.

A
--
You received this message because you are subscribed to the Google Groups "hashdist" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashdist+u...@googlegroups.com.
To post to this group, send email to hash...@googlegroups.com.
Visit this group at https://groups.google.com/group/hashdist.
To view this discussion on the web visit https://groups.google.com/d/msgid/hashdist/8f813370-2c70-4be3-8b72-432d30a5342f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dag Sverre Seljebotn

unread,
Mar 1, 2017, 8:54:53 AM3/1/17
to hash...@googlegroups.com, Chris Kees, Ondřej Čertík
Sorry for being so unresponsive to you all lately, I have some more time
to at least read my email better and participate in discussions over the
coming couple of months.

So I see two things getting mentioned here:

1) Make Hashdist as-is as user-friendly as conda

2) Use Hashdist to build (ana)conda packages that can be plugged into
the (ana)conda stack.

I don't have strong opinions on either of these. 1) would be cool in
itself but I agree with not taking that route if it is possible to
leverage conda.

The question I have though is how to resolve dependencies. One reason
Hashdist works well is that you do in fact control all the dependencies
fully (in hashstack) -- what happens if you simply use it to build conda
binaries? Won't you get package collisions? (I just tried to install
"pyfits" in conda, which wasn't in the standard anaconda channels so
that I had to use a custom channel, and couldn't figure out how to do it
without package conflicts, and had to "pip install" it)

Solution 1)

If you emit one package per Hashdist hash, so you get
"numpy-hashstack-12312qss12", then you basically won't use anything of
anaconda. So you only use "conda". I am not clear about what exactly the
advantages are as you make a chasm between the Hashstack users and the
Anaconda users. But if it means we can cut out part of Hashdist (the
profile building part) it could be a win code-wise to cut down what is
"special" for hashdist, but there would still be a Hashstack vs Anaconda

Solution 2)

Add conda metadata manually ("numpy", tagged "hashstack"). But I feel
Hashdist is overkill for this then -- if most of your dependencies is in
the conda ecosystem already, and you produce a conda product, and you
have to resolve conflicts and incompatibilities manually instead of
having automatic rebuilds, isn't a Makefile or shell script for your
package sufficient?

Perhaps a tool to automate conda package builds is needed, but I feel
Hashdist to be kind of overkill. Perhaps gut out parts of Hashdist and
just keep the Hashstack spec format and build Conda packages from it
(but include conda metadata in the Hashstack spec files)? I could get
beyond that, but if we go that way I think it's better to embrace it
than make some hybrid that still use hash-based builds, then ditches the
hashes right before they become useful.

Solution 3)

Somehow convince Continuum to make Hashstack == Anaconda, so that we
have a more open source build of Anaconda, and port their Anaconda build
over.

Solution 4)

Use hash-based packages for stuff Hashdist builds, but be smart about
"aliasing" certain builds to Anaconda packages, so that if you build
NumPy with a given set of flags the hash portion is stripped and you end
up with only "numpy".


Dag Sverre

Ondřej Čertík

unread,
Mar 1, 2017, 10:47:29 AM3/1/17
to hash...@googlegroups.com
On Tue, Feb 28, 2017 at 2:22 PM, Volker Braun <vbrau...@gmail.com> wrote:
> On Tuesday, February 28, 2017 at 7:13:56 PM UTC+1, ondrej.certik wrote:
>>
>> [...] the development profile in Hashdist
>> allows to install custom things into it, Conda handles it in similar
>> way. Neither allows modification of the installed locations of the
>> individual packages, as far as I know.
>
>
> I'd say the main difference is that hashdist uses symlinks so programs
> essentially run from ~/.hashdist while conda copies binaries (and patches
> paths in the binaries). This makes conda profiles simpler and more flexible,
> e.g. conda has binary packages and you can modify installed files as you
> want. I'm aware that this is not entirely clear-cut, hashdist has the
> development profile and conda can also use links. Still, hashdist doesn't
> work on (at least some) ARM boxes because patchelf segfaults (fragile
> alternative to just patching binaries) and e.g. "./default/bin/pip install
> --upgrade pip" dies with an "OSError: Cannot call rmtree on a symbolic link"
> even for a development profile.

That's exactly right. Conda, in my experience, has a much longer track
record of handling binary packages, and I think they really figured it
out, robustly and on all platforms. It's really quite non-trivial
(lots of things to take care of), and I just want to use it, and if
there are any bugs (I don't expect many), just fix them in Conda.

>
> The simplest "conda integration" would be just to pad the build paths in
> ~/.hashdist and then tar.bz2 up the artifacts with some metadata files
> (https://conda.io/docs/spec.html). Just have hashdist spit out a bunch of
> conda packages!

One problem with this is that Conda also handles the path relocation,
and if we use conda to actually build the package, I think it more or
less takes care of it, while if we build it ourselves, we have to make
sure we do it right. Currently the Hashstack is not relocatable, so
that's why I was thinking that actually just use hashdist to spit out
Conda source packages, and use conda to build them (and do the right
thing) might be easier.

My reasoning is super simple: our manpower in hashdist is limited, but
Conda's community is huge => Use Conda for everything we can, and
offload / take advantage of the Conda ecosystem/community whenever we
can. And use hashdist manpower on things that Conda doesn't do, and as
far as I can see, that's really only the source management with the
yaml profiles etc. But once hashdist figures out the bash script to
build it, I say let Conda take it from there.

Ondrej

Ondřej Čertík

unread,
Mar 1, 2017, 12:58:36 PM3/1/17
to Dag Sverre Seljebotn, hash...@googlegroups.com, Chris Kees
On Wed, Mar 1, 2017 at 6:54 AM, Dag Sverre Seljebotn
<d.s.se...@astro.uio.no> wrote:
> Sorry for being so unresponsive to you all lately, I have some more time to
> at least read my email better and participate in discussions over the coming
> couple of months.
>
> So I see two things getting mentioned here:
>
> 1) Make Hashdist as-is as user-friendly as conda
>
> 2) Use Hashdist to build (ana)conda packages that can be plugged into the
> (ana)conda stack.
>
> I don't have strong opinions on either of these. 1) would be cool in itself
> but I agree with not taking that route if it is possible to leverage conda.

And in practice: we tried to make hashdist user-friendly, but we just
don't have the man power. To get the manpower, we need to become a
successful open source project with lots of users and community. We
tried to do everything ourselves (i.e. no Conda), and it's a lot of
work and we didn't manage to get the ball rolling in terms of lots of
users and thus contributors. But with Conda, I think we can actually
get there.

>
> The question I have though is how to resolve dependencies. One reason
> Hashdist works well is that you do in fact control all the dependencies
> fully (in hashstack) -- what happens if you simply use it to build conda
> binaries? Won't you get package collisions? (I just tried to install
> "pyfits" in conda, which wasn't in the standard anaconda channels so that I
> had to use a custom channel, and couldn't figure out how to do it without
> package conflicts, and had to "pip install" it)
>
> Solution 1)
>
> If you emit one package per Hashdist hash, so you get
> "numpy-hashstack-12312qss12", then you basically won't use anything of
> anaconda. So you only use "conda". I am not clear about what exactly the

That's right. In fact, the hashdist hash will not be part of the name,
but rather part of the version in Conda --- Conda has a tag (I forgot
their exact term for it), where you can stick the hashdist hash and it
will just work.

> advantages are as you make a chasm between the Hashstack users and the
> Anaconda users. But if it means we can cut out part of Hashdist (the profile
> building part) it could be a win code-wise to cut down what is "special" for
> hashdist, but there would still be a Hashstack vs Anaconda

Initially we only use Conda to handle the binaries, but the stack is
exactly as it is currently in Hashstack --- all packages are ours, we
are not leveraging anybody's else packages. It's my understanding that
Conda already has tons of conflicting packages, e.g. Anaconda,
conda-forge, bioconda, ... you name it. So there is no problem by
adding Hashstack packages into the mix.

I am thinking of using hit to produce the source packages and conda
build from there and use conda to manage the profiles directly. But we
can in principle wrap the conda commands in hit, so you only use hit,
and you don't need to know there is conda underneath. That's just a
tiny wrapper, so I am not worried about that now.

>
> Solution 2)
>
> Add conda metadata manually ("numpy", tagged "hashstack"). But I feel
> Hashdist is overkill for this then -- if most of your dependencies is in the
> conda ecosystem already, and you produce a conda product, and you have to
> resolve conflicts and incompatibilities manually instead of having automatic
> rebuilds, isn't a Makefile or shell script for your package sufficient?

I wouldn't suggest this.

>
> Perhaps a tool to automate conda package builds is needed, but I feel
> Hashdist to be kind of overkill. Perhaps gut out parts of Hashdist and just
> keep the Hashstack spec format and build Conda packages from it (but include
> conda metadata in the Hashstack spec files)? I could get beyond that, but if
> we go that way I think it's better to embrace it than make some hybrid that
> still use hash-based builds, then ditches the hashes right before they
> become useful.

I suggest hashdist spits out the Conda spec + bash script (using our
hashdist+hashstack infrastructure to produce it), and let conda build
it from there.

>
> Solution 3)
>
> Somehow convince Continuum to make Hashstack == Anaconda, so that we have a
> more open source build of Anaconda, and port their Anaconda build over.

Too complicated and unclear --- once we provide proof of concept and
make the above actually work, then we can talk about some tighter
integration. Until then it's premature I think.

>
> Solution 4)
>
> Use hash-based packages for stuff Hashdist builds, but be smart about
> "aliasing" certain builds to Anaconda packages, so that if you build NumPy
> with a given set of flags the hash portion is stripped and you end up with
> only "numpy".

I am not sure I follow here. I would do what I proposed above, which
will get us something that works, and then we will improve upon it as
we see fit. We can revisit the discussion then.

Ondrej

Dag Sverre Seljebotn

unread,
Mar 1, 2017, 1:37:49 PM3/1/17
to hash...@googlegroups.com, Ondřej Čertík
FWIW I'm generally +1 at what you say Ondrej. Is this were consensus is
at now with this project?

Comments:

On 01. mars 2017 18:58, Ondřej Čertík wrote:
> On Wed, Mar 1, 2017 at 6:54 AM, Dag Sverre Seljebotn
>>
>> The question I have though is how to resolve dependencies. One reason
>> Hashdist works well is that you do in fact control all the dependencies
>> fully (in hashstack) -- what happens if you simply use it to build conda
>> binaries? Won't you get package collisions? (I just tried to install
>> "pyfits" in conda, which wasn't in the standard anaconda channels so that I
>> had to use a custom channel, and couldn't figure out how to do it without
>> package conflicts, and had to "pip install" it)
>>
>> Solution 1)
>>
>> If you emit one package per Hashdist hash, so you get
>> "numpy-hashstack-12312qss12", then you basically won't use anything of
>> anaconda. So you only use "conda". I am not clear about what exactly the
>
> That's right. In fact, the hashdist hash will not be part of the name,
> but rather part of the version in Conda --- Conda has a tag (I forgot
> their exact term for it), where you can stick the hashdist hash and it
> will just work.

OK, and it's possible to specify a strong requirement -- if X depends on
NumPy with hash x34asds from HashStack, it won't ever get substituted
with another NumPy with same major and minor version from another channel?


>> advantages are as you make a chasm between the Hashstack users and the
>> Anaconda users. But if it means we can cut out part of Hashdist (the profile
>> building part) it could be a win code-wise to cut down what is "special" for
>> hashdist, but there would still be a Hashstack vs Anaconda
>
> Initially we only use Conda to handle the binaries, but the stack is
> exactly as it is currently in Hashstack --- all packages are ours, we
> are not leveraging anybody's else packages. It's my understanding that
> Conda already has tons of conflicting packages, e.g. Anaconda,
> conda-forge, bioconda, ... you name it. So there is no problem by
> adding Hashstack packages into the mix.

It's a big difference in how you brand and announce it. If you tell me
that "Hashdist now builds conda packages" I'm gonna feel I can take a
single package and mix it into my Anaconda stack. So we need to spin it
more like "Hashstack profiles are now managed using the conda tool" or
people will be disappointed / get conflicts and trouble.

>> Solution 4)
>>
>> Use hash-based packages for stuff Hashdist builds, but be smart about
>> "aliasing" certain builds to Anaconda packages, so that if you build NumPy
>> with a given set of flags the hash portion is stripped and you end up with
>> only "numpy".
>
> I am not sure I follow here. I would do what I proposed above, which
> will get us something that works, and then we will improve upon it as
> we see fit. We can revisit the discussion then.

I guess that what I mean is simply that at some point (in the future!)
one could start playing with loosening the requirements so that a
package can depend on *any* NumPy 1.12.x, instead of
NumPy-with-hash-aaw3rwa2f3wa. This could take the form of adding
run-time as opposed to build-time dependency metadata to packages .. but
as you say, is something to think about later. But it's another reason
to start using conda I guess that it opens up for this.

Dag Sverre




>
> Ondrej
>

Michael Sarahan

unread,
Mar 1, 2017, 1:52:51 PM3/1/17
to hashdist, ondrej...@gmail.com
Greetings all from the Conda team.  I'm the conda-build guy right now.  I think it's worth adding that conda-build has just added a (massive) PR that brings it much more in line with what HashDist does (I think): https://github.com/conda/conda-build/pull/1585

Every conda package will now also have an associated hash.  Your version number can be a hash itself, but I'd only actually recommend that if your package is actually versioned using a hash - let conda-build use its hash for uniqueness, and perhaps store HashDist's hash in the recipe metadata for future reference (the extra section would be good for this).  Dependencies can be pinned exactly, or allowed to vary based on version numbering.  I think to start, hashdist would produce conda recipes that are exactly pinned.

Looking forward to working with you all on this.

Michael

Ondřej Čertík

unread,
Mar 1, 2017, 2:23:25 PM3/1/17
to Dag Sverre Seljebotn, hash...@googlegroups.com
On Wed, Mar 1, 2017 at 11:37 AM, Dag Sverre Seljebotn
<d.s.se...@astro.uio.no> wrote:
> FWIW I'm generally +1 at what you say Ondrej. Is this were consensus is at
> now with this project?

I agree with the rest of your comments. Yes, it is possible to pin the
dependency exactly, using a hash or otherwise.

Regarding the consensus, Aron is the one who got the idea, and he
convinced me and since then I am convinced this is the way to go.
Chris is I think open to the idea in general, and it looks like so is
Volker.

Regarding how we announce it, I agree 100%. My strategy is to keep
this low profile, see how it works, if it can deliver, and once we get
our hands dirty and have something that works, we can get more people
to try it out, etc., and then we can figure out what we want to do --
whether to wrap Conda and pretend it doesn't exist, i.e. that Hashdist
is the main thing and Conda is just an implementation detail, or the
other way round --- that Conda solves the binary problem and thus
should be used, and hashdist solves the source problem by providing
the packages. Or some mix in between. I don't have a strong opinion
either way. Then we figure out how to market it in our documentation
and main page, so that people get the right idea.

Ondrej

Ondřej Čertík

unread,
Mar 1, 2017, 2:36:08 PM3/1/17
to Michael Sarahan, hashdist
Hi Michael,

On Wed, Mar 1, 2017 at 11:52 AM, Michael Sarahan <msar...@gmail.com> wrote:
> Greetings all from the Conda team. I'm the conda-build guy right now. I
> think it's worth adding that conda-build has just added a (massive) PR that
> brings it much more in line with what HashDist does (I think):
> https://github.com/conda/conda-build/pull/1585

Thanks for the email. Is your job what Aaron Meurer was doing before?

>
> Every conda package will now also have an associated hash. Your version
> number can be a hash itself, but I'd only actually recommend that if your
> package is actually versioned using a hash - let conda-build use its hash
> for uniqueness, and perhaps store HashDist's hash in the recipe metadata for
> future reference (the extra section would be good for this). Dependencies
> can be pinned exactly, or allowed to vary based on version numbering. I
> think to start, hashdist would produce conda recipes that are exactly
> pinned.

That looks good, it's great that Conda got support for using hashes as well.

What do you think should be our first step? I was thinking the following:

* take some simple hashdist profile with few (say 3) packages
* make hashdist generate the directory structure+Conda specs for these
3 packages
* call conda-build can build on it.

How do you make conda-build handle dependencies between source
packages? I read this:

https://conda.io/docs/building/recipe.html

but I am still a bit confused.

As an example, say the hashdist profile contains three packages A, B
and C. Where A and B are independent, but C depends on both A and B.
Do I have to call conda-build manually in the directory for A, then
directory for B, and finally in the directory for C (and conda will
somehow know about A and B)?

Ondrej

Gamblin, Todd

unread,
Mar 1, 2017, 2:48:56 PM3/1/17
to hash...@googlegroups.com, Dag Sverre Seljebotn, Chris Kees
Hi all,

I wanted to chime in with a few points, and see if I can be convincing.  I’m the creator/lead developer of Spack (https://github.com/LLNL/spack), and we’ve had some prior discussions this list comparing hashdist to Spack.  I know some of you IRL, but I’ve never met Dag or Chris, so hello!  

Spack *has* an open source community, and it’s focused on HPC.  We have managed to convince most of the large HPC centers in DOE to get on board, and many are contributing.  We are also getting contributions from many HPC centers in academia and outside the US (CERN, Fermi, NASA-GISS, EPFL, BSC, etc.)  See slide 23 here for how contributions have grown, and from where:


and here for a measure of how the level of contributions to the project:


Chris wrote:

What drove the development of hashdist was the need for customized, reproducible builds of scientific Python stacks, with many non-Python dependencies and specifically vendor/host dependencies. In particular, supporting those builds in a platform-independent way across machines in a _slightly_ challenging HPC environment. That need still exists and hasn’t been addressed well by other tools that I know of.

Spack had similar motivations, perhaps without the focus on Python.  I would say we wanted to build large codes in a __very__ challenging (GPU/xeon phi/cray/power8/etc.) HPC environment.

Spack also already provides *many* of the features on Chris’s feature list, as well as many features that neither hashdist nor Conda have that are good for HPC. These have helped us to grow adoption:

1. compiler provenance
2. compiler swapping
3. support for cross-compiled platforms (platforms can have multiple OS/target combinations)
4. generating modules (lmod & tcl)
5. using vendor packages that are only provided through modules (e.g. on Cray)
6. multiple dependency types: build/link/run
- allows, e.g., front-end build deps to be built for the host and not target on cross-compiled machines
7. virtual dependencies, build variants, and a dependency system that supports them
- swappable MPI, swappable BLAS/LAPACK/SCALAPACK versions
8. mirroring packages over an air gap
9. A query interface, a syntax, and a dependency model for all of that.

At the dependency level, Spack is not so different from hashdist.  We have full control over the dependencies, like hashdist, but to some extent we have more control because we support more parameters (compilers, variants, versions, etc.), and we allow dependencies on builds with specific parameters.  Like hashdist, we use hashes to identify builds in a combinatorial build space, and you can refer to builds by hash, but you can also refer to builds and query them based on their build parameters through the UI.

Where we are lacking is in profile/environment support and in binary packaging.  These are things that, IMHO, are not as difficult to add as Ondrej thinks, and I believe that it’s far more important in the HPC environment to keep the provenance all the way through to the binary installation.  Per Dag:

The question I have though is how to resolve dependencies.

In Conda, you’d have an opaque hash, and you have a non-parameterized dependency system (except by version).  In Spack we can look at the metadata for an installation and tell whether a binary (and its dependencies) there can be substituted into a particular DAG we want to build.  We are also planning to add more sophisticated capabilities to the dependency system, like depending on particular language levels (C++11, C++14, OpenMP 4+, etc.) and satisfying *those* dependencies with a suitable compiler.  We have a prototype binary packager that actually works.  We need to add security (basically signing of packages).

Ondrej wrote:
To get the manpower, we need to become a successful open source project with lots of users and community. 

I agree that Conda is gaining in popularity, but I am not at all convinced that it is the right solution for HPC, or for a lot of the things hashdist is capable of doing.  I am also not convinced that having two tools is going to result in a great user experience or encourage the adoption you want.  I suspect, based on my experience with my users, that they won’t like it.

In addition to the community, we have people dedicated to Spack here at LLNL (~2 FTE), and we’ve actively worked to grow the community and to integrate external contributions.  We have buy-in from DOE HPC centers, and they actually contribute.  NERSC has a dedicated person who works with us, and we get tons of contributions from ANL, and to a lesser extent ORNL.  ARM uses Spack for their compiler regression suite, and we’ve started working with Intel to port a number of ML libraries to use Spack.  IBM is also contributing, as part of the CORAL procurement.  We also have buy-in from a number of HPC app developers to package their stacks using Spack, via ECP.

So my question is, what would we need to do or add to get the hashdist folks on board with Spack?  Or are you sold on the hashdist/conda combination?  I’m not opposed to adding things to Spack to suit the needs of other communities, and we would love to have the many smart folks on this list as contributors/peers on a shared project.  I think we could make a better package manager that way.

-Todd












--
You received this message because you are subscribed to the Google Groups "hashdist" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashdist+u...@googlegroups.com.
To post to this group, send email to hash...@googlegroups.com.
Visit this group at https://groups.google.com/group/hashdist.

Michael Sarahan

unread,
Mar 1, 2017, 3:07:48 PM3/1/17
to Ondřej Čertík, hashdist
Hi Ondrej,

On Wed, Mar 1, 2017 at 1:36 PM, Ondřej Čertík <ondrej...@gmail.com> wrote:
Hi Michael,

On Wed, Mar 1, 2017 at 11:52 AM, Michael Sarahan <msar...@gmail.com> wrote:
> Greetings all from the Conda team.  I'm the conda-build guy right now.  I
> think it's worth adding that conda-build has just added a (massive) PR that
> brings it much more in line with what HashDist does (I think):
> https://github.com/conda/conda-build/pull/1585

Thanks for the email. Is your job what Aaron Meurer was doing before?


Kind of?  I'm more strictly focused on build concerns than Aaron was.  I don't do much with conda proper, but I'm the primary developer of conda-build, and I'm also working on a package CI system.
 
>
> Every conda package will now also have an associated hash.  Your version
> number can be a hash itself, but I'd only actually recommend that if your
> package is actually versioned using a hash - let conda-build use its hash
> for uniqueness, and perhaps store HashDist's hash in the recipe metadata for
> future reference (the extra section would be good for this).  Dependencies
> can be pinned exactly, or allowed to vary based on version numbering.  I
> think to start, hashdist would produce conda recipes that are exactly
> pinned.

That looks good, it's great that Conda got support for using hashes as well.

What do you think should be our first step? I was thinking the following:

* take some simple hashdist profile with few (say 3) packages
* make hashdist generate the directory structure+Conda specs for these
3 packages
* call conda-build can build on it.

This seems like a very reasonable approach.
 

How do you make conda-build handle dependencies between source
packages? I read this:

https://conda.io/docs/building/recipe.html

There's build-time dependencies and run-time dependencies.  The best reference, by far, is https://conda.io/docs/building/meta-yaml.html - sorry our docs are unintuitive.  I often have to resort to using Google to find anything in them. =(

There's also new docs in a PR at https://github.com/conda/conda-docs/pull/414 - the PR I mentioned last time adds a lot of flexibility in pinning.  Basically, tools that express things like "pin the runtime version of "a" to the major/minor version whatever the version of "a" was that was installed at build time" and "pin the runtime version of "a" to exactly (hash and everything) the version of "a" that was installed at build time."  It also provides a very flexible way of specifying which versions to install at build time - the conda_build_config.yaml file, plus jinja2 templates in meta.yaml to use these values.
 
Conda-build has a naive recursive approach to building dependencies.  If you try to build C, and A and B are not available, it will go back and try to build A and B first.  It's a clumsy way of building the DAG as you go.  There's an actual DAG in our CI system that I hope to port to conda-build someday: https://github.com/conda/conda-concourse-ci/blob/master/conda_concourse_ci/compute_build_graph.py#L136 - it has roughly the same approach, but is less "try and fail" and more "let me ask if I can actually do this before I try to do it."

Michael

Ondřej Čertík

unread,
Mar 1, 2017, 3:16:47 PM3/1/17
to hash...@googlegroups.com, Dag Sverre Seljebotn, Chris Kees
Hi Todd,

On Wed, Mar 1, 2017 at 12:48 PM, Gamblin, Todd <gamb...@llnl.gov> wrote:
> Hi all,
>
> I wanted to chime in with a few points, and see if I can be convincing. I’m
> the creator/lead developer of Spack (https://github.com/LLNL/spack), and
> we’ve had some prior discussions this list comparing hashdist to Spack. I
> know some of you IRL, but I’ve never met Dag or Chris, so hello!
>
> Spack *has* an open source community, and it’s focused on HPC. We have
> managed to convince most of the large HPC centers in DOE to get on board,
> and many are contributing. We are also getting contributions from many HPC
> centers in academia and outside the US (CERN, Fermi, NASA-GISS, EPFL, BSC,
> etc.) See slide 23 here for how contributions have grown, and from where:
>
> https://spack.io/slides/Spack-ECP-Exascale-Package-Manager.pdf

I saw your presentation at ECP in Knoxville. Great presentation, and
by talking to people there, I think that you made it, as an open
source project, I feel you got the minimal community and the ball
rolling, as well enough HPC users and companies to use Spack. Sorry I
didn't manage to find you at the conference to talk.
Yes, these features are impressive.
So this is just my own motivation, others might have different ones:

* I want both HPC *and* desktop development and I want to use the same
tool. Conda works on Linux, Mac and Windows, as well as HPC. My
understanding is that Spack doesn't work on Windows and I don't know
how important is to support the usage of Spack for the end user, just
like Conda works.

* I don't like your license, I wish you switched to a BSD or MIT (I
think I mentioned it couple times, both in person and on a
mailinglist)


I need to test Spack again to see if it works for me. I'll write more
after I play with the latest Spack. It has improved a lot since the
last time I played with it.

Ondrej

Gamblin, Todd

unread,
Mar 1, 2017, 3:50:10 PM3/1/17
to hash...@googlegroups.com, Dag Sverre Seljebotn, Chris Kees
On Mar 1, 2017, at 12:16 PM, Ondřej Čertík <ondrej...@gmail.com> wrote:

Hi Todd,

On Wed, Mar 1, 2017 at 12:48 PM, Gamblin, Todd <gamb...@llnl.gov> wrote:

Spack *has* an open source community, and it’s focused on HPC.  We have
managed to convince most of the large HPC centers in DOE to get on board,
and many are contributing.  We are also getting contributions from many HPC
centers in academia and outside the US (CERN, Fermi, NASA-GISS, EPFL, BSC,
etc.)  See slide 23 here for how contributions have grown, and from where:

https://spack.io/slides/Spack-ECP-Exascale-Package-Manager.pdf

I saw your presentation at ECP in Knoxville. Great presentation, and
by talking to people there, I think that you made it, as an open
source project, I feel you got the minimal community and the ball
rolling, as well enough HPC users and companies to use Spack. Sorry I
didn't manage to find you at the conference to talk.

Yeah, sorry I didn’t find time to talk at the meeting, either.  I saw you but didn’t manage to catch you as I ran around to various meetings.  I am attempting to see the ECP folks on a) sensible deployment and b) CI.  It’s a project :).  FWIW, I want (a) to be architecture-specific binaries (e.g., optimized haswell, power8le, ARMv7, etc. builds).  

So my question is, what would we need to do or add to get the hashdist folks
on board with Spack?  Or are you sold on the hashdist/conda combination?
I’m not opposed to adding things to Spack to suit the needs of other
communities, and we would love to have the many smart folks on this list as
contributors/peers on a shared project.  I think we could make a better
package manager that way.

So this is just my own motivation, others might have different ones:

I’d be very curious about what others think, too :).  I can also point out that at LLNL, I cannot guarantee, but i can argue for subcontracts if hashdist folks would want to build their favorite features into Spack.  I have, many times, when short on cycles, thought about calling up Dag to see if he’d be interested (I know he was at least somewhat keen on an embedded Python DSL as opposed to YAML).

* I want both HPC *and* desktop development and I want to use the same
tool. Conda works on Linux, Mac and Windows, as well as HPC. My
understanding is that Spack doesn't work on Windows and I don't know
how important is to support the usage of Spack for the end user, just
like Conda works.

I think this is important, as well.  We have developers here who use macs, and many of our users do development on their macs.  We have paid Kitware to get Qt5 working on macs, and we actually support a mocked-up XCode build there because Qt is such a bear.  Perhaps I shouldn’t call it the “supercomputing” package manager, as that scares people away. 

w.r.t. Windows, we have at least one major code team with many DOD customers who use Windows, so I would honestly like to see it work on Windows, too.  We have removed one major barrier: the use of fork(), and we switched to vanilla subprocess.  The only other thing we’d need would be to figure out is what to do with symlinks.  There are a few places they’re used, but AFAIK they can be replaced with copying or hard links if we have to do that.  I thought hit used symlinks — am I wrong?  What does it do on windows?  I have not delved into it for a while. 

* I don't like your license, I wish you switched to a BSD or MIT (I
think I mentioned it couple times, both in person and on a
mailinglist)

If it would get you guys on board, I would be completely willing to change the license.  I’m not particularly religious about this, and I have thought about it a number of times. At this point we would need to do some remediation.  LLNL lacks a non-cumbersome CLA process, so we don’t have one on Spack. We would need to ask people to get on board.  If some don’t, we’d need to rewrite some parts of the code, but most contributed code is in packages, not in the core.  Also, LGPL packages could be used even if the base was BSD/MIT, if someone objected for whatever reason.

Question, though, what don’t you like about LGPL for a package manager written in Python?  It’s compatible with BSD/MIT and vice versa, and to distribute it, you need to send along the source anyway.  You can still have external package repos that are not “in” spack.  Is there something proprietary you want to do with the Spack core?  I guess I don’t see much functional difference between LGPL and MIT for this use case.  But like I said, I’m willing to try.

I need to test Spack again to see if it works for me. I'll write more
after I play with the latest Spack. It has improved a lot since the
last time I played with it.

I think you will find the env support lacking, but that is something we have an active WG for, and I think it’s the last piece for a really go user experience.  I am working on better dependency resolution (SAT solving on all our parameters) at the moment, which we need to support stateful installs into an existing environment.  But we have much better support for external packages since last you looked:


And we have some pretty cool interfaces for passing information between builds within Spack, some of which go in the same directions of currently open hashdist HEPs.


TL; DR, let me know what you and others want.  I would be willing to work pretty hard to get your expertise and to get hashdist folks as contributors.

-Todd





Ondrej

-- 
You received this message because you are subscribed to the Google Groups "hashdist" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashdist+u...@googlegroups.com.
To post to this group, send email to hash...@googlegroups.com.
Visit this group at https://groups.google.com/group/hashdist.

Ondřej Čertík

unread,
Mar 1, 2017, 6:22:44 PM3/1/17
to hash...@googlegroups.com, Dag Sverre Seljebotn, Chris Kees
I can answer this one quickly --- if I like your Paraview build, and
want to copy how you do it into Hashdist, I would have to change the
license of the Hashdist paraview spec file to LGPL. If, on the other
hand, you used BSD or MIT, all we have to do is to copy the Spack's
BSD or MIT license somewhere in the Hashdist documentation, and we can
freely copy any code we want. The same if I wanted to use Spack for
some Conda packages. The same if I implement some feature into Spack,
and then wanted to use the same code in Hashdist. And so on, there are
endless combinations, where I would be forced to switch the given file
to LGPL.

I'll answer the rest later. I just want to say thanks for your
willingness to work with us.

Ondrej
> https://groups.google.com/d/msgid/hashdist/4B690014-4426-47E1-9661-20CF149252E1%40llnl.gov.

Ondřej Čertík

unread,
Mar 1, 2017, 6:52:29 PM3/1/17
to hash...@googlegroups.com, Dag Sverre Seljebotn, Chris Kees
Here are the first few issues I hit:

https://github.com/LLNL/spack/issues/3295
https://github.com/LLNL/spack/issues/3296

Especially the second one is a blocker for me currently, without it I
can't try the libraries/programs that I need.

Ondrej

Ondřej Čertík

unread,
Mar 2, 2017, 1:25:32 AM3/2/17
to hash...@googlegroups.com, Dag Sverre Seljebotn, Chris Kees
Ok, resolved, now I am hitting this one:

https://github.com/LLNL/spack/pull/3298

Sure, I picked up perhaps a difficult package (paraview), but that's
something that I need now for work, and it works with Hashdist, so I
know how to compile it, it's just a matter of forcing Spack to do the
right thing.

Ondrej

>
> Ondrej

Gamblin, Todd

unread,
Mar 2, 2017, 4:07:39 AM3/2/17
to hash...@googlegroups.com, Dag Sverre Seljebotn, Chris Kees
I *believe* for this particular case, you’d be ok, as to put it in hashdist you’d convert it from Python to yaml/shell anyway, which i don’t think falls under copyright, as you’ve rewritten what we did in Spack at that point.  It would be hard to say that translating a build script constitutes copying some part of Spack — remember this is copyright, not patents.  IANAL, IANYL, TINLA, etc.

But I can relate, as I had a very similar discussion with the easybuild guys about their code being GPL, so we can’t reuse their builds without having to release Spack as GPL :).

I'll answer the rest later. I just want to say thanks for your
willingness to work with us.

Sure thing!

-Todd

Dag Sverre Seljebotn

unread,
Mar 2, 2017, 4:16:40 AM3/2/17
to Gamblin, Todd, hash...@googlegroups.com, Chris Kees
On 02. mars 2017 10:07, Gamblin, Todd wrote:
> On Mar 1, 2017, at 3:22 PM, Ondřej Čertík <ondrej...@gmail.com
I think a big part of it is simply that with BSD or MIT, you don't have
to think. As soon as you are touched by LGPL, you at least have to
consider what the effects are, or if you are in a company you need to
involver lawyers etc. -- which is a pain regardless of what the legal
answer in the end turns out to be. As you just demonstrated by having to
say IANAL on a simultaneously critical and trivial question.

Dag Sverre

Gamblin, Todd

unread,
Mar 2, 2017, 4:28:39 AM3/2/17
to Dag Sverre Seljebotn, hash...@googlegroups.com, Chris Kees
On Mar 2, 2017, at 1:16 AM, Dag Sverre Seljebotn <d.s.se...@astro.uio.no> wrote:

I can answer this one quickly --- if I like your Paraview build, and
want to copy how you do it into Hashdist, I would have to change the
license of the Hashdist paraview spec file to LGPL. If, on the other
hand, you used BSD or MIT, all we have to do is to copy the Spack's
BSD or MIT license somewhere in the Hashdist documentation, and we can
freely copy any code we want. The same if I wanted to use Spack for
some Conda packages. The same if I implement some feature into Spack,
and then wanted to use the same code in Hashdist. And so on, there are
endless combinations, where I would be forced to switch the given file
to LGPL.

I *believe* for this particular case, you’d be ok, as to put it in
hashdist you’d convert it from Python to yaml/shell anyway, which i
don’t think falls under copyright, as you’ve rewritten what we did in
Spack at that point.  It would be hard to say that translating a build
script constitutes copying some part of Spack — remember this is
copyright, not patents.  IANAL, IANYL, TINLA, etc.

But I can relate, as I had a very similar discussion with the easybuild
guys about their code being GPL, so we can’t reuse their builds without
having to release Spack as GPL :).

I think a big part of it is simply that with BSD or MIT, you don't have to think. As soon as you are touched by LGPL, you at least have to consider what the effects are, or if you are in a company you need to involver lawyers etc. -- which is a pain regardless of what the legal answer in the end turns out to be. As you just demonstrated by having to say IANAL on a simultaneously critical and trivial question.

There is some irony.here in that one reason we went with LGPL was from prior experiences with some companies, where they had an *easier* time contributing to LGPL projects because they were obligated to do so, whereas with BSD/MIT they had the option to keep things proprietary and therefore had to go through a major review with lawyers. Either way, I’d be ok with a permissive license.  It might also help us down the road if, say, an HPC vendor wanted to use Spack as a deployment tool for their machine.  My view is that would benefit us.

How do you all feel about Apache 2?  LLNL (and various other projects like Kubernetes) are starting to prefer it due to the patent indemnification clause (it is otherwise the same as BSD).  It is not as compatible as BSD/MIT though — in particular it’s not compatible with GPL2.  I would be tempted to push for MIT/BSD because, as you say, I only have so many cycles to think.

-Todd

Ondřej Čertík

unread,
Mar 2, 2017, 10:06:50 AM3/2/17
to hash...@googlegroups.com, Dag Sverre Seljebotn, Chris Kees
On Thu, Mar 2, 2017 at 2:28 AM, Gamblin, Todd <gamb...@llnl.gov> wrote:
> On Mar 2, 2017, at 1:16 AM, Dag Sverre Seljebotn
> <d.s.se...@astro.uio.no> wrote:
>
>
> I can answer this one quickly --- if I like your Paraview build, and
> want to copy how you do it into Hashdist, I would have to change the
> license of the Hashdist paraview spec file to LGPL. If, on the other
> hand, you used BSD or MIT, all we have to do is to copy the Spack's
> BSD or MIT license somewhere in the Hashdist documentation, and we can
> freely copy any code we want. The same if I wanted to use Spack for
> some Conda packages. The same if I implement some feature into Spack,
> and then wanted to use the same code in Hashdist. And so on, there are
> endless combinations, where I would be forced to switch the given file
> to LGPL.
>
>
> I *believe* for this particular case, you’d be ok, as to put it in
> hashdist you’d convert it from Python to yaml/shell anyway, which i
> don’t think falls under copyright, as you’ve rewritten what we did in
> Spack at that point. It would be hard to say that translating a build
> script constitutes copying some part of Spack — remember this is
> copyright, not patents. IANAL, IANYL, TINLA, etc.

Well, that's not what your own license says. ;)

If you look at the text of your license, line 75:

https://github.com/LLNL/spack/blob/88f97c07dea843f2a2c1d87347edccb69c093903/LICENSE#L75

it says very clearly "or *translated* straightforwardly into another language":

"""
A "work based on the
Library" means either the Library or any derivative work under
copyright law: that is to say, a work containing the Library or a
portion of it, either verbatim or with modifications and/or translated
straightforwardly into another language.
"""

and below it establishes that "work based on the Library" must be
licensed as LGPL.


In general, any translation to another language is a derivative work,
that's well known:

http://softwareengineering.stackexchange.com/questions/151515/rewrote-gnu-gpl-v2-code-in-another-language-can-i-change-a-license
http://softwareengineering.stackexchange.com/questions/86754/is-it-possible-to-rewrite-every-line-of-an-open-source-project-in-a-slightly-dif
...

but LGPL actually makes this explicitly clear.

So even if we took Spack code and rewrote straightforwardly (i.e. line
by line) from Python to C++, it's a derivative work, and thus must be
LGPL licensed. Now in the case of, say, Paraview, it's even simpler,
because there I would literally take the cmake options and copy &
pasted to Hashdist (peraps remove some Python syntax), and so then I
am literally copying the code, thus LGPL applies.

Btw, this is not some theoretical scenario, I literally copied code
from Hashdist to Spack yesterday here:

https://github.com/LLNL/spack/pull/3298/files#diff-c3aab198d1c00ddc4bdd8fe3d8dfea50R69

Now, since I happened to write that Hashdist line, I own the
copyright, so I can submit to Spack (it's a little more complicated as
an employer of a corporation, but the idea stays), but if somebody
else wrote it, then we would have to copy the Hashdist BSD license
somewhere. Not a big deal, since it isn't forcing Spack to relicense.
But if I did the opposite, and copied some code from Spack to Hashdist
or Conda, I would have to relicense that file. So that's a problem.

>
> But I can relate, as I had a very similar discussion with the easybuild
> guys about their code being GPL, so we can’t reuse their builds without
> having to release Spack as GPL :).

Indeed you cannot.

>
>
> I think a big part of it is simply that with BSD or MIT, you don't have to
> think. As soon as you are touched by LGPL, you at least have to consider
> what the effects are, or if you are in a company you need to involver
> lawyers etc. -- which is a pain regardless of what the legal answer in the
> end turns out to be. As you just demonstrated by having to say IANAL on a
> simultaneously critical and trivial question.
>
>
> There is some irony.here in that one reason we went with LGPL was from prior
> experiences with some companies, where they had an *easier* time
> contributing to LGPL projects because they were obligated to do so, whereas

Were they redistributing spack to other people? Because if they only
used it within the company, neither GPL nor LGPL requires them to
contribute back, as they are not distributing the work.

> with BSD/MIT they had the option to keep things proprietary and therefore
> had to go through a major review with lawyers. Either way, I’d be ok with a

I touched it above --- as an employee of a corporation, it's the
company that owns your work, and by default any contribution back to,
say, Spack requires the legal department to okay it, no matter what
license Spack uses (whether LGPL or BSD). That's where the Contributor
License Agreement (CLA) comes in --- the company, not the person, has
to grant you a copyright license to their work under the Spack's LGPL
or BSD. So if they didn't run it through their legal, they can get in
big trouble.
Most open source projects don't want to bother with CLA, so the
copyright license grant is implicit by submitting a PR.


> permissive license. It might also help us down the road if, say, an HPC
> vendor wanted to use Spack as a deployment tool for their machine. My view
> is that would benefit us.
>
> How do you all feel about Apache 2? LLNL (and various other projects like
> Kubernetes) are starting to prefer it due to the patent indemnification
> clause (it is otherwise the same as BSD). It is not as compatible as
> BSD/MIT though — in particular it’s not compatible with GPL2. I would be
> tempted to push for MIT/BSD because, as you say, I only have so many cycles
> to think.

I am fine with Apache. I still think MIT/BSD is better, but Apache is
a big improvement over LGPL.

Ondrej

>
> -Todd
>
> --
> You received this message because you are subscribed to the Google Groups
> "hashdist" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hashdist+u...@googlegroups.com.
> To post to this group, send email to hash...@googlegroups.com.
> Visit this group at https://groups.google.com/group/hashdist.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/hashdist/CA40EE99-84E3-44DE-9E5C-937335EA8F1E%40llnl.gov.

Gamblin, Todd

unread,
Mar 2, 2017, 12:10:16 PM3/2/17
to hash...@googlegroups.com, Dag Sverre Seljebotn, Chris Kees
Ondrej:

On Mar 2, 2017, at 7:06 AM, Ondřej Čertík <ondrej...@gmail.com> wrote:

On Thu, Mar 2, 2017 at 2:28 AM, Gamblin, Todd <gamb...@llnl.gov> wrote:

I *believe* for this particular case, you’d be ok, as to put it in
hashdist you’d convert it from Python to yaml/shell anyway, which i
don’t think falls under copyright, as you’ve rewritten what we did in
Spack at that point.  It would be hard to say that translating a build
script constitutes copying some part of Spack — remember this is
copyright, not patents.  IANAL, IANYL, TINLA, etc.

Well, that's not what your own license says. ;)

If you look at the text of your license, line 75:

https://github.com/LLNL/spack/blob/88f97c07dea843f2a2c1d87347edccb69c093903/LICENSE#L75

it says very clearly "or *translated* straightforwardly into another language”:

True enough.  :(.

So even if we took Spack code and rewrote straightforwardly (i.e. line
by line) from Python to C++, it's a derivative work, and thus must be
LGPL licensed. Now in the case of, say, Paraview, it's even simpler,
because there I would literally take the cmake options and copy &
pasted to Hashdist (peraps remove some Python syntax), and so then I
am literally copying the code, thus LGPL applies.

Since we’re getting into the details, I will point out that there is an originality requirement in US copyright law (https://en.wikipedia.org/wiki/Threshold_of_originality#United_States).  I don’t actually think any of the build recipes would stand up as more than “mere sweat of the brow” under scrutiny, so I think it would be extremely hard to bring any kind of copyright claim based on a stolen configure line.  But, again, this requires you to think about it, and it’s not the position implied by the license, which is admittedly a pain.

But if I did the opposite, and copied some code from Spack to Hashdist
or Conda, I would have to relicense that file. So that's a problem.

True.

I think a big part of it is simply that with BSD or MIT, you don't have to
think. As soon as you are touched by LGPL, you at least have to consider
what the effects are, or if you are in a company you need to involver
lawyers etc. -- which is a pain regardless of what the legal answer in the
end turns out to be. As you just demonstrated by having to say IANAL on a
simultaneously critical and trivial question.


There is some irony.here in that one reason we went with LGPL was from prior
experiences with some companies, where they had an *easier* time
contributing to LGPL projects because they were obligated to do so, whereas

Were they redistributing spack to other people? Because if they only
used it within the company, neither GPL nor LGPL requires them to
contribute back, as they are not distributing the work.

I think in this case it was just an internal policy of the company — not based on legal requirements.  In practice I’ve found that *most* places are the other way around.

with BSD/MIT they had the option to keep things proprietary and therefore
had to go through a major review with lawyers. Either way, I’d be ok with a

I touched it above --- as an employee of a corporation, it's the
company that owns your work, and by default any contribution back to,
say, Spack requires the legal department to okay it, no matter what
license Spack uses (whether LGPL or BSD). That's where the Contributor
License Agreement (CLA) comes in --- the company, not the person, has
to grant you a copyright license to their work under the Spack's LGPL
or BSD. So if they didn't run it through their legal, they can get in
big trouble.
Most open source projects don't want to bother with CLA, so the
copyright license grant is implicit by submitting a PR.

I’m actually pretty sure this is not true — a PR doesn’t imply copyright transfer, which is why we’d have to get CLAs from people to relicense.  I’m willing to do it, though.  LLNL hasn’t historically had a clear policy on this.  One thing I am fighting for internally is more clarity so that people releasing OSS projects understand their options.

How do you all feel about Apache 2?  LLNL (and various other projects like
Kubernetes) are starting to prefer it due to the patent indemnification
clause (it is otherwise the same as BSD).  It is not as compatible as
BSD/MIT though — in particular it’s not compatible with GPL2.  I would be
tempted to push for MIT/BSD because, as you say, I only have so many cycles
to think.

I am fine with Apache. I still think MIT/BSD is better, but Apache is
a big improvement over LGPL.

Ok, I’ll work on this.  I guess the next question is — is that all you need?  I’d be interested to hear from other hashdist folks.

-Todd




Ondrej


-Todd

--
You received this message because you are subscribed to the Google Groups
"hashdist" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to hashdist+u...@googlegroups.com.
To post to this group, send email to hash...@googlegroups.com.
Visit this group at https://groups.google.com/group/hashdist.
To view this discussion on the web visit
https://groups.google.com/d/msgid/hashdist/CA40EE99-84E3-44DE-9E5C-937335EA8F1E%40llnl.gov.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "hashdist" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashdist+u...@googlegroups.com.
To post to this group, send email to hash...@googlegroups.com.
Visit this group at https://groups.google.com/group/hashdist.

Ondřej Čertík

unread,
Mar 2, 2017, 1:19:33 PM3/2/17
to hash...@googlegroups.com
On Thu, Mar 2, 2017 at 10:10 AM, Gamblin, Todd <gamb...@llnl.gov> wrote:
Ondrej:

On Mar 2, 2017, at 7:06 AM, Ondřej Čertík <ondrej...@gmail.com> wrote:

On Thu, Mar 2, 2017 at 2:28 AM, Gamblin, Todd <gamb...@llnl.gov> wrote:

I *believe* for this particular case, you’d be ok, as to put it in
hashdist you’d convert it from Python to yaml/shell anyway, which i
don’t think falls under copyright, as you’ve rewritten what we did in
Spack at that point.  It would be hard to say that translating a build
script constitutes copying some part of Spack — remember this is
copyright, not patents.  IANAL, IANYL, TINLA, etc.

Well, that's not what your own license says. ;)

If you look at the text of your license, line 75:

https://github.com/LLNL/spack/blob/88f97c07dea843f2a2c1d87347edccb69c093903/LICENSE#L75

it says very clearly "or *translated* straightforwardly into another language”:

True enough.  :(.

So even if we took Spack code and rewrote straightforwardly (i.e. line
by line) from Python to C++, it's a derivative work, and thus must be
LGPL licensed. Now in the case of, say, Paraview, it's even simpler,
because there I would literally take the cmake options and copy &
pasted to Hashdist (peraps remove some Python syntax), and so then I
am literally copying the code, thus LGPL applies.

Since we’re getting into the details, I will point out that there is an originality requirement in US copyright law (https://en.wikipedia.org/wiki/Threshold_of_originality#United_States).  I don’t actually think any of the build recipes would stand up as more than “mere sweat of the brow” under scrutiny, so I think it would be extremely hard to bring any kind of copyright claim based on a stolen configure line.  But, again, this requires you to think about it, and it’s not the position implied by the license, which is admittedly a pain.

Yes, but for example if I wanted to copy some of your truly original work, say, some code from your compiler wrappers (which I think is your own original invention as far as I can tell), then the "sweat of the bow" doesn't apply.
 

had to go through a major review with lawyers. Either way, I’d be ok with a

I touched it above --- as an employee of a corporation, it's the
company that owns your work, and by default any contribution back to,
say, Spack requires the legal department to okay it, no matter what
license Spack uses (whether LGPL or BSD). That's where the Contributor
License Agreement (CLA) comes in --- the company, not the person, has
to grant you a copyright license to their work under the Spack's LGPL
or BSD. So if they didn't run it through their legal, they can get in
big trouble.
Most open source projects don't want to bother with CLA, so the
copyright license grant is implicit by submitting a PR.

I’m actually pretty sure this is not true — a PR doesn’t imply copyright transfer, which is why we’d have to get CLAs from people to relicense.  I’m willing to do it, though.  LLNL hasn’t historically had a clear policy on this.  One thing I am fighting for internally is more clarity so that people releasing OSS projects understand their options.

Correct, a PR does not mean copyright transfer, which is distinct to granting a copyright license, which is what I wrote. CLA can do both, for example FSF does copyright transfer, but most CLAs, including Google I think, just grant a copyright license, but the contributor keeps the copyright. The advantage of a copyright transfer for you is that you can relicense, but the disadvantage for the contributor is that they can't submit the same code to another project, because they lost the copyright to it. The advantage of granting a copyright license is that the contributor keeps the copyright, but Spack still has a permission to use the code under the Spack's license, but the disadvantage is that you can't relicense without a permission from all the contributors. As you said.

In most open source projects it is understood that by submitting a PR, the contributor is implicitly giving the open source project a permission to use their contribution under their license, while keeping the copyright to it. Because submitting a PR is essentially asking the project to merge my code into your project, so I am implicitly giving you permission to use my code (under your license). CLA makes this explicit, but it's a hassle to manage, so most open source projects don't do CLA.

 

How do you all feel about Apache 2?  LLNL (and various other projects like
Kubernetes) are starting to prefer it due to the patent indemnification
clause (it is otherwise the same as BSD).  It is not as compatible as
BSD/MIT though — in particular it’s not compatible with GPL2.  I would be
tempted to push for MIT/BSD because, as you say, I only have so many cycles
to think.

I am fine with Apache. I still think MIT/BSD is better, but Apache is
a big improvement over LGPL.

Ok, I’ll work on this.  I guess the next question is — is that all you need?  I’d be interested to hear from other hashdist folks.

I only speak for myself here.

No, the license was actually a minor, though important point, but not just from me, e.g. even at the ECP meeting this exact issue was brought up by some Sandia people towards the MFEM (https://github.com/mfem/mfem) project. It's relatively minor, since you can be a very successful open source project even as GPL licensed, e.g. git.

The more important points are the technical and social ones, and when I tried Spack a few years ago, I had several concerns, many of which might have been mitigated since, but I need to use Spack for a while to tell. The main issue in this thread, of course, is binary packages and Spack does not actually do that yet, and I think Hashdist + Conda can deliver, it is unclear to me if Spack can deliver the same way as Conda does.

You said it's not that hard, but I am not sure, as you should distinguish conda-build (https://github.com/conda/conda-build) that does the actual relocatable build, and perhaps that's not that hard (though it's not trivial either), but the main conda (https://github.com/conda/conda) is all about the binary distribution, you have to have a polished experience, some place to upload packages (your own or e.g. https://anaconda.org/certik/dashboard, maintaining such a website is already a lot of work), and it needs to run across Mac, Linux, Windows and HPC, and then it needs to manage the environments, etc. Conda does all that. I don't know, you can browse some of their PRs to get an idea of the kind of issues that need to be taken care of:

https://github.com/conda/conda/pulls

my understanding it's mostly good user experience. I view Conda something like the "modules" system that Spack depends on, but much better.

Ondrej

Jimmy Tang

unread,
Mar 12, 2017, 3:39:08 PM3/12/17
to hashdist, d.s.se...@astro.uio.no, cek...@gmail.com
Hi All,

On Wednesday, 1 March 2017 19:48:56 UTC, Todd Gamblin wrote:
Hi all,

I wanted to chime in with a few points, and see if I can be convincing.  I’m the creator/lead developer of Spack (https://github.com/LLNL/spack), and we’ve had some prior discussions this list comparing hashdist to Spack.  I know some of you IRL, but I’ve never met Dag or Chris, so hello!  

Spack *has* an open source community, and it’s focused on HPC.  We have managed to convince most of the large HPC centers in DOE to get on board, and many are contributing.  We are also getting contributions from many HPC centers in academia and outside the US (CERN, Fermi, NASA-GISS, EPFL, BSC, etc.)  See slide 23 here for how contributions have grown, and from where:


and here for a measure of how the level of contributions to the project:


Chris wrote:

What drove the development of hashdist was the need for customized, reproducible builds of scientific Python stacks, with many non-Python dependencies and specifically vendor/host dependencies. In particular, supporting those builds in a platform-independent way across machines in a _slightly_ challenging HPC environment. That need still exists and hasn’t been addressed well by other tools that I know of.

Spack had similar motivations, perhaps without the focus on Python.  I would say we wanted to build large codes in a __very__ challenging (GPU/xeon phi/cray/power8/etc.) HPC environment.

Spack also already provides *many* of the features on Chris’s feature list, as well as many features that neither hashdist nor Conda have that are good for HPC. These have helped us to grow adoption:

1. compiler provenance
2. compiler swapping
3. support for cross-compiled platforms (platforms can have multiple OS/target combinations)
4. generating modules (lmod & tcl)
5. using vendor packages that are only provided through modules (e.g. on Cray)
6. multiple dependency types: build/link/run
- allows, e.g., front-end build deps to be built for the host and not target on cross-compiled machines
7. virtual dependencies, build variants, and a dependency system that supports them
- swappable MPI, swappable BLAS/LAPACK/SCALAPACK versions
8. mirroring packages over an air gap
9. A query interface, a syntax, and a dependency model for all of that.

At the dependency level, Spack is not so different from hashdist.  We have full control over the dependencies, like hashdist, but to some extent we have more control because we support more parameters (compilers, variants, versions, etc.), and we allow dependencies on builds with specific parameters.  Like hashdist, we use hashes to identify builds in a combinatorial build space, and you can refer to builds by hash, but you can also refer to builds and query them based on their build parameters through the UI.


A little late to the discussion but, I think these are all valid points, but having used hashdist and recently spack a bit, I think hashdist serves the needs of developers/scientists and spack feels like it's serving the needs of a sysadmin more. I'm also not entirely sure how well hashdist could integrate with conda.
 
To post to this group, send email to has...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages