Thesis proposal to play in CBL scope.

262 views
Skip to first unread message

luigi viscardi

unread,
Sep 6, 2013, 10:41:48 AM9/6/13
to cloudb...@googlegroups.com
I am Luigi Viscardi and I am preparing my master's degree thesis in Informatics at the Università degli Studi of Milano. Below you can see the thesis proposal to play in CBL scope. I would to ask you any suggestions to help me to identify the requirements and the best target for the community. 

Targets
The focus of the thesis is on Versioning and Reproducibility.
  • Versioning
CloudBioLinux (CBL) is essentially a Linux distribution derived from Ubuntu on which various dedicated/specialized packages are installed. Some (but not all) of these packages correspond to .deb format of Ubuntu/Debian.
The possibility to have a software also in a different format (not packaged) brings problems like contrast of requisites and the necessity to manage updates and dependences of part of the software.
One ideal solution to the versioning problem should be having packages of all the programs (in .deb format of Ubuntu/Debian). This is clearly impossible, especially in a scientific environment where there is a strong need to use experimental software, still in a development phase.
Through the implementation (or consolidation) of library containing tools/programs, it should be possible:
    •   to generate reports of the versions of the software installed and their corresponding status of update
    •   to guarantee a robust, reliable and consistent installation process
  • Reproducibility
Starting from a virtual operating machine (updated compared to the one initially installed) it should be possible:
  •   to gain an installable imagine with the same packages (it is not possible to copy the whole virtual machine to avoid sharing users/data).
  •   to compare the imagine with another one, highlighting differences (and eventually identifying the necessary steps to make them the same)
The implemented library will have to make available tools/programs that allow the analyses of a running system and the creation of a configuration able to replicate installed packages.

The rationalization of the Versioning and Reproducibility aspects will be supported with the restructuring of the installation process of the basic system (minimal). The target is to guarantee to the user the certainty on the outcome of the activity (negative outcome or inability to proceed) and in particular:
  • pre–installation phase: this is the initial phase of the process of creation of the system, during which the following checks are carried out:
    • check of the availability of the basic tools needed for the installation process
    • check of the environment where to proceed with the deploy and where the problems, that could stop the installation process, can be identified
  • guided installation: it allows guided installation, suitable for  shared/end users (not developers and not systems analysts)
  • logging/reporting: it guaranteed an adequate level of details of the status of the deploy process and of its result to clearly identify possible issues
  • testing phase: it could be interesting to make available an environment where to test the specialized /dedicated software in order to execute a validation process everytime that there is an update activity
Strategies
Many strategies could be adopted:
  1. the extensive reuse of the existing tools (dpkg/apt), whose reliability in the process of packages installation and upgrade, even in the presence of errors, guarantees a positive outcome of the operations:
    • to use apt/dpkg as much as possible, likely providing packaged versions of most of the interesting packages, allow a simple installation process and versioning
  2. to improve the fabric infrastructure already in use:
    • at this moment fabric is use only for installation packages
    • the extension could cover versioning management
  3. to use an external tool, like nixpkgs:
    • it's indipended from every legacy/specific management tools
    • has many interesting features relatively to versioning and reproducibility process:
      • allows a non destructive upgrade
      • allows a undo/rollback process
      • allows a simultaneous presence of different version of packages 
  4. combining inputs from multiple package managers into an overall picture of the system:
    • in the CBL environment there are two groups of software programs:
      • the real packages (in .deb format or not packaged)
      • the libraries of development tools (python, R, perl, etc) who have a native management tools
    • both management tools (apt/dpkg and native management tools) allow a simple installation process and versioning, so has much sense, for the first one, try to use a .deb format for every package, while for the second the best solution is the use of native tools
The thesis activity will have as target the study and analisys of these possibile solutions, to try to identify not only the best solution, but the most useful for CBL community.

Thanks in advance for every suggestions,

Luigi

Brad Chapman

unread,
Sep 12, 2013, 3:06:13 PM9/12/13
to luigi viscardi, cloudb...@googlegroups.com

Luigi;
Thanks for this writeup. Nice work. We've already talked offlist about
this so I know this incorporates most of the previous discussion. I'd
love to hear what other people in the community think, but from my
perspective helping improve versioning would be a great help.

Practically, here is the script that currently accumulates all of the
package, library, and custom versions on a machine:

https://github.com/chapmanb/cloudbiolinux/blob/master/utils/cbl_installed_software.py

So happy to have you build off of that or take whatever direction you
feel is best. Looking forward to it,
Brad


> I am Luigi Viscardi and I am preparing my master's degree thesis in
> Informatics at the Università degli Studi of Milano. Below you can see the
> thesis proposal to play in CBL scope. I would to ask you any suggestions to
> help me to identify the requirements and the best target for the community.
>
> *Targets*
> The focus of the thesis is on Versioning and Reproducibility.
>
> - *Versioning*
>
> CloudBioLinux (CBL) is essentially a Linux distribution derived from Ubuntu
> on which various dedicated/specialized packages are installed. Some (but
> not all) of these packages correspond to .deb format of Ubuntu/Debian.
> The possibility to have a software also in a different format (not
> packaged) brings problems like contrast of requisites and the necessity to
> manage updates and dependences of part of the software.
> One ideal solution to the versioning problem should be having packages of
> all the programs (in .deb format of Ubuntu/Debian). This is clearly
> impossible, especially in a scientific environment where there is a strong
> need to use experimental software, still in a development phase.
> Through the implementation (or consolidation) of library containing
> tools/programs, it should be possible:
>
> - to generate reports of the versions of the software installed and
> their corresponding status of update
> - to guarantee a robust, reliable and consistent installation
> process
>
>
> - *Reproducibility*
>
> Starting from a virtual operating machine (updated compared to the one
> initially installed) it should be possible:
>
> - to gain an installable imagine with the same packages (it is not
> possible to copy the whole virtual machine to avoid sharing users/data).
> - to compare the imagine with another one, highlighting differences
> (and eventually identifying the necessary steps to make them the same)
>
> The implemented library will have to make available tools/programs that
> allow the analyses of a running system and the creation of a configuration
> able to replicate installed packages.
>
> The rationalization of the Versioning and Reproducibility aspects will be
> supported with the restructuring of the installation process of the basic
> system (minimal). The target is to guarantee to the user the certainty on
> the outcome of the activity (negative outcome or inability to proceed) and
> in particular:
>
> - *pre–installation phase*: this is the initial phase of the process of
> creation of the system, during which the following checks are carried out:
> - check of the availability of the basic tools needed for the
> installation process
> - check of the environment where to proceed with the deploy and where
> the problems, that could stop the installation process, can be identified
> - *guided installation*: it allows guided installation, suitable for
> shared/end users (not developers and not systems analysts)
> - l*ogging/reporting*: it guaranteed an adequate level of details of the
> status of the deploy process and of its result to clearly identify possible
> issues
> - *testing phase*: it could be interesting to make available an
> environment where to test the specialized /dedicated software in order to
> execute a validation process everytime that there is an update activity
>
> *Strategies*
> Many strategies could be adopted:
>
> 1. the extensive reuse of the existing tools (dpkg/apt), whose
> reliability in the process of packages installation and upgrade, even in
> the presence of errors, guarantees a positive outcome of the operations:
> - to use apt/dpkg as much as possible, likely providing packaged
> versions of most of the interesting packages, allow a simple installation
> process and versioning
> 2. to improve the fabric infrastructure already in use:
> - at this moment fabric is use only for installation packages
> - the extension could cover versioning management
> 3. to use an external tool, like nixpkgs:
> - it's indipended from every legacy/specific management tools
> - has many interesting features relatively to versioning and
> reproducibility process:
> - allows a non destructive upgrade
> - allows a undo/rollback process
> - allows a simultaneous presence of different version of packages
> 4. combining inputs from multiple package managers into an overall
> picture of the system:
> - in the CBL environment there are two groups of software programs:
> - the real packages (in .deb format or not packaged)
> - the libraries of development tools (python, R, perl, etc) who have
> a native management tools
> - both management tools (apt/dpkg and native management tools) allow
> a simple installation process and versioning, so has much sense, for the
> first one, try to use a .deb format for every package, while for the second
> the best solution is the use of native tools
>
> The thesis activity will have as target the study and analisys of these
> possibile solutions, to try to identify not only the best solution, but the
> most useful for CBL community.
>
> Thanks in advance for every suggestions,
>
> Luigi
>
> --
> You received this message because you are subscribed to the Google Groups "cloudbiolinux" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudbiolinu...@googlegroups.com.
> To post to this group, send email to cloudb...@googlegroups.com.
> Visit this group at http://groups.google.com/group/cloudbiolinux.
> For more options, visit https://groups.google.com/groups/opt_out.

Roman Valls Guimera

unread,
Sep 17, 2013, 3:00:05 PM9/17/13
to cloudb...@googlegroups.com
Hi Luigi,

You could also try to further automate FPM packaging within CBL:

https://github.com/jordansissel/fpm

Imagine a scenario where:

The .tar.gz downloading/installation logic is "intercepted" by FPM,
trying to build a package at the end of the installation process.

If a package is automatically packaged successfully via FPM, submit it
to a "staging package site", similar to launchpad.net but "not ready for
mass consumption".

Cool outcomes from that project would be:

1) Reducing the custom logic for package management within CBL.
2) Publishing "draft" packages that could be further polished manually,
potentially leading to an approved upstream package in the major
distributions.

Just my 2 cents ;)
Roman

2013-09-06 16:41, luigi viscardi skrev:
> I am Luigi Viscardi and I am preparing my master's degree thesis in
> Informatics at the Universit� degli Studi of Milano. Below you can see
> the thesis proposal to play in CBL scope. I would to ask you any
> suggestions to help me to identify the requirements and the best target
> for the community.
>
> *Targets*
> The focus of the thesis is on Versioning and Reproducibility.
>
> * *Versioning*
>
> CloudBioLinux (CBL) is essentially a Linux distribution derived from
> Ubuntu on which various dedicated/specialized packages are installed.
> Some (but not all) of these packages correspond to .deb format of
> Ubuntu/Debian.
> The possibility to have a software also in a different format (not
> packaged) brings problems like contrast of requisites and the necessity
> to manage updates and dependences of part of the software.
> One ideal solution to the versioning problem should be having packages
> of all the programs (in .deb format of Ubuntu/Debian). This is clearly
> impossible, especially in a scientific environment where there is a
> strong need to use experimental software, still in a development phase.
> Through the implementation (or consolidation) of library containing
> tools/programs, it should be possible:
>
> o to generate reports of the versions of the software installed
> and their corresponding status of update
> o to guarantee a robust, reliable and consistent installation
> process
>
> * *Reproducibility*
>
> Starting from a virtual operating machine (updated compared to the one
> initially installed) it should be possible:
>
> * to gain an installable imagine with the same packages (it is not
> possible to copy the whole virtual machine to avoid sharing users/data).
> * to compare the imagine with another one, highlighting differences
> (and eventually identifying the necessary steps to make them the same)
>
> The implemented library will have to make available tools/programs that
> allow the analyses of a running system and the creation of a
> configuration able to replicate installed packages.
>
> The rationalization of the Versioning and Reproducibility aspects will
> be supported with the restructuring of the installation process of the
> basic system (minimal). The target is to guarantee to the user the
> certainty on the outcome of the activity (negative outcome or inability
> to proceed) and in particular:
>
> * *pre�installation phase*: this is the initial phase of the process
> of creation of the system, during which the following checks are
> carried out:
> o check of the availability of the basic tools needed for the
> installation process
> o check of the environment where to proceed with the deploy and
> where the problems, that could stop the installation process,
> can be identified
> * *guided installation*: it allows guided installation, suitable for
> shared/end users (not developers and not systems analysts)
> * l*ogging/reporting*: it guaranteed an adequate level of details of
> the status of the deploy process and of its result to clearly
> identify possible issues
> * *testing phase*: it could be interesting to make available an
> environment where to test the specialized /dedicated software in
> order to execute a validation process everytime that there is an
> update activity
>
> *Strategies*
> Many strategies could be adopted:
>
> 1. the extensive reuse of the existing tools (dpkg/apt), whose
> reliability in the process of packages installation and upgrade,
> even in the presence of errors, guarantees a positive outcome of the
> operations:
> * to use apt/dpkg as much as possible, likely providing packaged
> versions of most of the interesting packages, allow a simple
> installation process and versioning
> 2. to improve the fabric infrastructure already in use:
> * at this moment fabric is use only for installation packages
> * the extension could cover versioning management
> 3. to use an external tool, like nixpkgs:
> * it's indipended from every legacy/specific management tools
> * has many interesting features relatively to versioning and
> reproducibility process:
> o allows a non destructive upgrade
> o allows a undo/rollback process
> o allows a simultaneous presence of different version of packages
> 4. combining inputs from multiple package managers into an overall
> picture of the system:
> * in the CBL environment there are two groups of software programs:
> * the real packages (in .deb format or not packaged)
> o the libraries of development tools (python, R, perl, etc)
> who have a native management tools
> * both management tools (apt/dpkg and native management tools)
> allow a simple installation process and versioning, so has much
> sense, for the first one, try to use a .deb format for every
> package, while for the second the best solution is the use of
> native tools
>
> The thesis activity will have as target the study and analisys of these
> possibile solutions, to try to identify not only the best solution, but
> the most useful for CBL community.
>
> Thanks in advance for every suggestions,
>
> Luigi
>

Pjotr Prins

unread,
Sep 18, 2013, 1:55:54 AM9/18/13
to cloudb...@googlegroups.com
Hi Roman,

I appreciate the work you are doing with fpm. It is cool to generate
packages for two target distribution systems.

I think, however, that for the general case it is much better to take
a distribution agnostic packaging system that really solves atomic
changes, roll-backs and fixes dependency hell. If you haven't, spend a
few hours reading up on nixpkgs. You'll be surprised. You'll be able
to support any number of memcached versions concurrently with
underlying versioned dependencies(!)

I came exactly from where you are writing a meta-packager. It is
people like us who can appreciate a well-designed system. Compared to
Nixpkgs, Debian and RPM are rather broken, even if they work well
enough for existing distributions (I do run Debian with Nixpkgs on
top, and I can deploy the same packages on CentOS).

BTW CBL has Nix support.

Pj.

luigi viscardi

unread,
Sep 20, 2013, 8:26:41 AM9/20/13
to cloudb...@googlegroups.com
Roman,

your work is very interesting.
The ability to transform a non-packaged software (or in a different format) to a specific packaged format is a good idea, especially for a unified management.
I'm trying to  understand what could be the way to make the installation and package management more simple and useful as possible.
I'm looking for different solutions, and certainly I will also try your solution.

Thanks,
Luigi

Tim Booth

unread,
Sep 26, 2013, 7:50:01 AM9/26/13
to cloudb...@googlegroups.com
Hi Luigi,

Sorry if I've misunderstood some of the points below but I hope I can
contribute to this conversation.

Regarding easier building of packages:

As someone who has made a lot of DEB-format packages, I know there is
room for improvement in scripts like dh_make and jh_make that try to
auto-roll draft packages for you but they do work, in that you can
indeed get from a bare tarball to a .deb file in just a few commands.

Things that take time are:

1. Resolving dependencies, especially if they are not already
packaged. jh_make tries to do this automatically, while dh_make
expects you to work them out yourself. If there is a
well-behaved ./configure script you can probably infer some deps
automatically, but in the general case you need to do detective
work and read the docs.
2. Describing the package with a summary so people can see what it
is in the package manager. This is less important for libraries
which are installed automatically as dependencies of something
else, but in general it has to be done manually.
3. Checking license terms. A chore of a job, but I don't want to
get sued or get my users in trouble for distributing things I'm
not allowed to, and many "free" bits of academic software
actually have restrictive or discriminatory license terms.
4. Adding a watch file to help you check for future updates.
5. Deciding what goes in the actual package and where it should
live. For example if the tarball contains 50MB of examples, or
documentation in 7 formats, or convenience copies of libraries
you already have on the system, or supporting scripts that only
work if your database is installed
in /opt/Sanger/data/~hugo53/test etc.
6. Other random per-package cruft

Most things that you download don't just compile into a single binary so
there are always decisions to be made and fixes to be added. It's easy
to look at packaging systems like DPKG and RPM and say they are crufty
and burdened with legacy features and things like NIX are demonstrably
better and neater, but the truth is that most of the time-consuming
complexity comes from the individual packages not the packaging system
itself.

Regarding auditing the software on a system so as to reproduce a given
analysis:

I remember this being a hot topic at
http://www.open-bio.org/wiki/Codefest_2012 back in LA, but I can't find
details of who was working on it or what the outputs were (some
prototype script that captured machine state?). Is there anyone who was
at the Codefest but not on this mailing list who might have thoughts on
the topic? (I was doing unrelated Galaxy stuff myself, but of course
the Galaxy tool shed is designed to audit what version of tool are being
used in analyses and allow them to be repeated)

Cheers,

TIM
--
Tim Booth <tbo...@ceh.ac.uk>
NERC Environmental Bioinformatics Centre

Centre for Ecology and Hydrology
Maclean Bldg, Benson Lane
Crowmarsh Gifford
Wallingford, England
OX10 8BB

http://nebc.nerc.ac.uk
+44 1491 69 2705

"Steffen Möller"

unread,
Sep 26, 2013, 8:39:53 AM9/26/13
to cloudb...@googlegroups.com
Hello,

All the bits and pieces that contribute to the regular
packages of Linux distributions have their very own little
reason why there were once introduced. There is good and
bad packaging, but it always goes down to allowing for
the automated rebuilding from the source and a clear
separation of what comes from the developers and what was
changed by the maintainer.

Tim was right in that if you have all the build dependencies
in the distribution, and the package just runs with "make",
there is not much left to do. Larger Ruby or Java suites make
problems because of their typically inflexible versioning
and breadth of their dependencies. Otherwise, I am not sure
one gains so much.

Best,

Steffen

> Gesendet: Donnerstag, 26. September 2013 um 13:50 Uhr
> Von: "Tim Booth" <tbo...@ceh.ac.uk>
> An: cloudb...@googlegroups.com
> Betreff: Re: [cloudbiolinux] Thesis proposal to play in CBL scope.

Brad Chapman

unread,
Sep 27, 2013, 11:39:57 AM9/27/13
to Tim Booth, cloudb...@googlegroups.com

Tim;

> Most things that you download don't just compile into a single binary so
> there are always decisions to be made and fixes to be added. It's easy
> to look at packaging systems like DPKG and RPM and say they are crufty
> and burdened with legacy features and things like NIX are demonstrably
> better and neater, but the truth is that most of the time-consuming
> complexity comes from the individual packages not the packaging system
> itself.

From my perspective, the downside of deb/rpm has nothing to do with
complexity but with being distribution-specific. If I'm trying to build
an installer for a tool with many third party dependencies, I end up
needing to support at least Ubuntu/Debian, CentOS/RedHat, and Mac OSX.
If I end up going the packaging route, this means writing a deb an RPM
and a homebrew recipe, and then maintaining those and getting them
pushed to upstream repositories on every version update.
I think the work y'all do on distribution-specific packaging is super
valuable to make rock-solid distributions. The temptation of other
tools is that they'd provide a way to write potentially complicated
build instructions once and make them able to run on lots of platforms.

> Regarding auditing the software on a system so as to reproduce a given
> analysis:
>
> I remember this being a hot topic at
> http://www.open-bio.org/wiki/Codefest_2012 back in LA, but I can't find
> details of who was working on it or what the outputs were (some
> prototype script that captured machine state?).

We ended up with this script that will query system packages, library
specific installs and custom scripts:

https://github.com/chapmanb/cloudbiolinux/blob/master/utils/cbl_installed_software.py

and generates YAML files with details on installed system versions:

https://github.com/chapmanb/cloudbiolinux/tree/master/manifest

This still needs work to be more generalized and automated but is a
proof on concept of the type of thing we'd love to have more formalized.

Brad

luigi viscardi

unread,
Oct 1, 2013, 5:15:30 AM10/1/13
to cloudb...@googlegroups.com, luigi viscardi
Brad, 

I did some tests of installation to understand how CBL works. I have installed the CBL environment using 3 different releases of Ubuntu;
  - 11.04 (natty)
  - 12.04 (precise)
  - 13.04 (raring)

On the attachment (univ_tesi_tests-installs.pdf) you can see the tests.

As a results of these tests, I wanted to ask some questions:
  1. what are the versions that must be supported?
  2. what happens when packages in config files *.yaml are not available (or are not installables)?
    1. bio-linux-dotter
    2. bio-linux-estscan
    3. bio-linux-catchall
    4. ...
  3. in the custom.yaml file there are some programs that are also available as packages (installables via apt-get):
    1. emboss
    2. bwa
    3. bowtie
    4. bowtie2
    5. gmap
    6. lastz
    7. ...
    • which should have priority?
  4. besides to know which packages are installed (and related versions), may be useful to verify the version with an official repository (such as https://github.com/chapmanb/cloudbiolinux/tree/master/manifest)?
    • In this way would be possible to validate/certify that the packages are consistent with *biolinux environment;
    • at this time many packages versions are different with different releases of Ubuntu
  5. in some python scripts in the CBL environment (for example deb.py, distribution.py, etc) , does not exist a real distinction between functional logic and data; this causes that everytime you must change data (often data configuration) you need edit the python scripts in many places, with the risk of making mistakes and/or to insert data in wrong position or in wrong format:
    • would not be longer useful and safe, use external configuration files (as for the *.yaml files)?
I hope I was clear enough,

Luigi
univ_tesi_tests-installs.pdf

Pjotr Prins

unread,
Oct 2, 2013, 5:53:32 AM10/2/13
to cloudb...@googlegroups.com
Hi Luigi,

In a nutshell you are saying is that we need a tractable system that
allows for rigorous versioning of software and dependencies. I fully
agree. All mentioned problems originate in the packaging system(s). I
think CBL is a mess now. Even if CBL is a usable mess.

Building a product on multiple packaging systems, targetting multiple
systems and adding a build system on top is a recipe for disaster.
Ironically, we are dealing in software here. One of the great things
about software is that you *can* make it rigorous, reproducible,
whilst retaining flexibility.

I don't think splitting the git tree is the way forward. That would
complicate things too much for CBL maintainers. And no more YAML
files, please. Certainly not for configuration. YAML is too limited for
that and not that easy to read.

What is needed is a versioning system that supports multiple versions
of software in multiple distributions (Ubuntu, Debian, CentOS) and
multiple versions thereof. Without reinventing the wheel.

I think we have an opportunity here to get it right. Don't go for the
compromise.

Pj.

Brad Chapman

unread,
Oct 3, 2013, 10:09:30 PM10/3/13
to luigi viscardi, cloudb...@googlegroups.com, Pjotr Prins

Luigi;
Thanks for testing out CloudBioLinux and walking through the different
installations.

> 1. what are the versions that must be supported?

12.04 and 13.04 are the two targets right now. Generally the goal is to
track the latest releases with support for the latest LTS.

> 2. what happens when packages in config files *.yaml are not
> available (or are not installables)?

We might end up taking them out if they're not widely installable. There
is really no hard and fast rule.

> 3. in the custom.yaml file there are some programs that are also
> available as packages (installables via apt-get):
> - which should have priority?

The custom builds happen after the system package builds, so skip if
the package is pre-installed and up to date with the custom version.
The custom builds are often useful for other systems where packages do
not exist, or for doing non-root installations into local directories.

> 4. besides to know which packages are installed (and related
> versions), may be useful to verify the version with an official repository
> (such as https://github.com/chapmanb/cloudbiolinux/tree/master/manifest)?

The general idea is to update the manifest with every new released
AWS version. It's out of date now as I haven't had a chance to roll a
new official Amazon AMI. A lot of focus recently has gone into local
builds but we definitely need a new AMI soon.

> 5. in some python scripts in the CBL environment (for example deb.py,
> distribution.py, etc) , does not exist a real distinction between
> functional logic and data; this causes that everytime you must change data
> (often data configuration) you need edit the python scripts in many places,
> with the risk of making mistakes and/or to insert data in wrong position or
> in wrong format:
> - would not be longer useful and safe, use external configuration files
> (as for the *.yaml files)?

Happy to talk about this if you have some practical examples of what
you'd like to move into the configuration. The general idea was that the
YAML is basically a set of packages so you can see what is getting
installed, and the rest lives in the code.

Pjotr:
> In a nutshell you are saying is that we need a tractable system that
> allows for rigorous versioning of software and dependencies. I fully
> agree. All mentioned problems originate in the packaging system(s). I
> think CBL is a mess now. Even if CBL is a usable mess.
>
> Building a product on multiple packaging systems, targetting multiple
> systems and adding a build system on top is a recipe for disaster.

We can agree to disagree here. CBL is totally practical and designed to
fill a gap that is not well solved with existing systems: providing
installation of lots of disparate custom biological software across
multiple architectures. It's definitely not aiming to be a build system
itself and I'll happily remove functionality and merge into other build
systems as they become available.

Practically one direction I'm looking at now is more tightly integrating
with Homebrew/Linuxbrew. This provides a lot of the benefits you both
suggest, and also has a strong set of scientific packages.

Thanks much for all the great discussion,
Brad

Tim Booth

unread,
Oct 4, 2013, 5:12:57 AM10/4/13
to cloudb...@googlegroups.com, luigi viscardi, Pjotr Prins
Hi,

I'd totally agree with Pjotr if this approach was being used to make
non-cloud systems. If you are setting up a server or workstation and
maintaining it over the lifetime of the hardware then the mess that
accumulates as you try to update and reconfigure the system will
eventually bring it to its knees (and rob the poor sysop of sanity).
This equally applies to regular VMs, which are often run for years.

But in the cloud, you often think of a compute node as "disposable". If
I want the latest CBL I start a new one from scratch, I don't try to
upgrade an old one. In this case, one has a lot more leeway when adding
software because the need to update systems in-place can largely be
ignored, and therefore CBL can prioritise useful features over uniform
Debian-policy-style correctness. The CBL project is still going to be
more tractable if the number of build system and custom installers can
be kept down, but I don't think the current approach will lead to
disaster. A fair amount of reinventing the wheel, quite probably, but
not disaster :-)

TIM

> > Building a product on multiple packaging systems, targetting multiple
> > systems and adding a build system on top is a recipe for disaster.
>
> We can agree to disagree here. CBL is totally practical and designed to
> fill a gap that is not well solved with existing systems: providing
> installation of lots of disparate custom biological software across
> multiple architectures. It's definitely not aiming to be a build system
> itself and I'll happily remove functionality and merge into other build
> systems as they become available.

Pjotr Prins

unread,
Oct 4, 2013, 5:25:36 AM10/4/13
to cloudb...@googlegroups.com
If CBL is only targetting cloud - maybe. But I think the audience can
be much wider if we get this right. That would mean more popular
uptake, and contributions in packaging and maintenance. I think that
by limiting ourselves in scope, we limit CBL's future.

Pj.

John Chilton

unread,
Oct 4, 2013, 7:07:37 AM10/4/13
to cloudb...@googlegroups.com
CBL is many things to many people. I agree with Tim Booth to a point -
namely certain aspect of CBL shouldn't be used in conjunction with
long running systems - installing Python dependencies globally with
pip, installing custom software directly into /usr/bin, etc...
Everything installed into /usr/bin should be coming from the package
manager. But just because some aspects of CloudBioLinux are unsuitable
for traditional systems, doesn't mean they all are or that is
shouldn't target things beyond cloud images.

I developed installation instructions for Galaxy-P (a customized
version of the Galaxy project for proteomics) using CloudBioLinux in
such a way that I think it is completely usable on long running
systems.

http://getgalaxyp.org/install.html

Namely I break the installation into two pieces, installing all of the
needed OS packages (CBL does a good job here configuring
apt/yum/etc... and differentiating which package is which on the
different systems) and then another step where custom software is
compiled or downloaded and installed. In this second case though it is
all installed into its own directory (e.g. /opt/galaxy/tools) for
instance. The Galaxy install version of these CBL tools support
multiple versions, setup Galaxy package files appropriately, etc....
Getting the latest updates is as easy as rerunning the install
procedure, it will just place the new updates in new version
directories, so there is a high degree of reproduciblity over time on
the same system that is possible here.

Many computational institutions support long running module archives
using tools such as Modules (http://modules.sourceforge.net/), this
can be thought of the same way (in fact it would probably be pretty
easy to add Modules support to this mechanism). I think from a system
administrator perspective - the important thing is they are in
isolated directories outside of /usr.

-John

Brad Chapman

unread,
Oct 5, 2013, 12:19:09 PM10/5/13
to John Chilton, cloudb...@googlegroups.com

Tim, Pjotr, Luigi and John;
John is 100% right with the current direction of CloudBioLinux. It
broadly does three things right now:

- Install system packages on multiple systems, leveraging
Bio-Linux on Ubuntu/Debian systems. This is by default into the system
(/usr) PATH.
- Install programming language specific libraries. This defaults into
wherever is best for specific languages.
- Install custom packages. This can be in any directory and does
not need to interfere with the system installations.

By default the cloud and VM based installs put everything into one
directory since the idea is to use the same automated process to
create another one later and don't need to maintain it long term.

For non-cloud program-specific installs I do exactly what John
describes, install all of the custom things into an isolated directory
you can add to the PATH, either manually or via modules. I use this with
our bcbio-nextgen installer:

https://github.com/chapmanb/bcbio-nextgen

and it nicely enables updates in place of the system, including new
third party tool versions.

Directory or module-based "isolation" is not perfect, since you can get
issues with LD_LIBRARY_PATH if you inject multiple modules, but is a
reasonable current approximation of local VMs without any additional
software installation. Longer term, I'm optimistic that lightweight
containers like Docker will replace this approach so you use
CloudBioLinux to create a Docker image and then ship that.

In terms of versioning for the custom non-system installs, I've done
some work the past couple of days to enable tighter integration with
Linuxbrew/Homebrew (https://github.com/homebrew/linuxbrew). This
provides nice versioning capabilities by using Git for the underlying
recipes. With today's CloudBioLinux you can do:

fab -H localhost install_brew:samtools

to get the latest samtools. Or to get a specific older version:

fab -H localhost install_brew:samtools,0.1.18

Homebrew has a nice scientific community creating recipes
(https://github.com/Homebrew/homebrew-science) and the infrastructure is
well set up for managing maintaining specific versions of software.
This is all brand new but I'm hoping to migrate some of the `custom`
tool builds over to Homebrew so they can take advantage of the
infrastructure and community.

Brad

Pjotr Prins

unread,
Oct 6, 2013, 3:41:09 AM10/6/13
to cloudb...@googlegroups.com, John Chilton
On Sat, Oct 05, 2013 at 12:19:09PM -0400, Brad Chapman wrote:
> For non-cloud program-specific installs I do exactly what John
> describes, install all of the custom things into an isolated directory
> you can add to the PATH, either manually or via modules. I use this with
> our bcbio-nextgen installer:

nixpkgs does exactly that. The point I am consistently making is that
we are reinventing the wheel. And badly, at that, when it comes to
make scripting.

> Homebrew has a nice scientific community creating recipes
> (https://github.com/Homebrew/homebrew-science) and the infrastructure is
> well set up for managing maintaining specific versions of software.
> This is all brand new but I'm hoping to migrate some of the `custom`
> tool builds over to Homebrew so they can take advantage of the
> infrastructure and community.

Have a read of this two-year old thread:

https://www.ruby-forum.com/topic/2104402

I am not saying homebrew is a bad idea - these Ruby things tend to
gain momentum, and I am a Ruby guy - but still they haven't gotten all
things right. What you trade is convenience for correctness.

It just hurts me that no one here even spends a single day seriously
trying nixpkgs - a system which I have been deploying for the last 5
years with gratifying results. But I'll shut up now about what I think
is the best way forward. I have made my point clearly and eloquently -

Me, I'll mix in homebrew for those packages that exist, that are not in
Debian, and do work - for the rest I'll keep using nix myself for
multi versioning, sane upgrades and cross-system deployment :). There
is nothing left to say but really try it for yourself. Oh, did I say
that nix has binary installs?

Good luck.

Pj.

Tim Booth

unread,
Oct 7, 2013, 6:57:28 AM10/7/13
to cloudb...@googlegroups.com
Obligatory XKCD:

http://xkcd.com/927/

TIM

On Sun, 2013-10-06 at 08:41 +0100, Pjotr Prins wrote:
> On Sat, Oct 05, 2013 at 12:19:09PM -0400, Brad Chapman wrote:
> > For non-cloud program-specific installs I do exactly what John
> > describes, install all of the custom things into an isolated directory
> > you can add to the PATH, either manually or via modules. I use this with
> > our bcbio-nextgen installer:
>
> nixpkgs does exactly that. The point I am consistently making is that
> we are reinventing the wheel. And badly, at that, when it comes to
> make scripting.

Brad Chapman

unread,
Oct 7, 2013, 9:59:50 AM10/7/13
to Pjotr Prins, cloudb...@googlegroups.com, John Chilton

Pjotr;
Thanks for all the helpful discussion.

> I am not saying homebrew is a bad idea - these Ruby things tend to
> gain momentum, and I am a Ruby guy - but still they haven't gotten all
> things right. What you trade is convenience for correctness.

What features do you feel like Homebrew is missing that Nix provides?

> It just hurts me that no one here even spends a single day seriously
> trying nixpkgs - a system which I have been deploying for the last 5
> years with gratifying results.

I spent quite a bit of time with Nix based on your recommendation. The
major issues I had were:

- It was difficult for me to setup for a single user non-root account.
What I'd like to be able to do is have CloudBioLinux install nix in a
non-privileged directory and do all the setup/configuration inside of
there. I ran into issues where nix wanted to put things into other
directories by default and required root. There might be workarounds
here, but I generally found the installation heavy-weight.

- There doesn't appear to be a community building nix packages for
biology. plink was recently added but beyond that there are no other
current tools:

https://github.com/NixOS/nixpkgs/tree/master/pkgs/applications/science/biology

- It wasn't clear how to setup an experimental/CloudBioLinux-specific
channel for pushing packages. One of the nice things about the
current `custom` framework is that we can push fixes live immediately
for rapidly changing development work. It would be great to have a
non-stable channel able to do this.

I could have spend more time on the install and channel issues but the
lack of existing packages or a nix biology community didn't give me a
push to do so. Am I missing repositories of existing packages we could
benefit from?

Thanks again for all the thoughts,
Brad

John Chilton

unread,
Oct 7, 2013, 12:49:36 PM10/7/13
to Pjotr Prins, Brad Chapman, cloudb...@googlegroups.com
On Mon, Oct 7, 2013 at 11:12 AM, Pjotr Prins <pjotr...@gmail.com> wrote:
> Hi Brad,
>
> On Mon, Oct 07, 2013 at 09:59:50AM -0400, Brad Chapman wrote:
>> > I am not saying homebrew is a bad idea - these Ruby things tend to
>> > gain momentum, and I am a Ruby guy - but still they haven't gotten all
>> > things right. What you trade is convenience for correctness.
>>
>> What features do you feel like Homebrew is missing that Nix provides?
>
> Two things, really. Multi-system binary support with guaranteed
> correctness, and transactional installs.
>
> It boils down that you can trust a Nixpkg to be what it presents -
> with ALL its dependencies. In a non-correct system, there is no way
> you can guarantee that a package has not been improperly compiled,
> that libraries have not been overwritten during upgrades (say a CBL
> compile after a Debian base install). Nix does away with all those
> worries. Better even, you can bundle a software with its dependencies
> and deploy it on another system, and you are guaranteed it is the same
> software running. This is what we want in Science (reproducibility),
> this is what we require in medicine (certified diagnostic tools).
> People just don't realise what this really means - to have a correct
> system.
>
> Transactional installs means that the software will complain if the
> install was not complete. Think SQL transactions. Nix does that.
>
>> > It just hurts me that no one here even spends a single day seriously
>> > trying nixpkgs - a system which I have been deploying for the last 5
>> > years with gratifying results.
>>
>> I spent quite a bit of time with Nix based on your recommendation. The
>> major issues I had were:
>>
>> - It was difficult for me to setup for a single user non-root account.
>> What I'd like to be able to do is have CloudBioLinux install nix in a
>> non-privileged directory and do all the setup/configuration inside of
>> there. I ran into issues where nix wanted to put things into other
>> directories by default and required root. There might be workarounds
>> here, but I generally found the installation heavy-weight.
>
> That is different than using Nix for CBL, right? You have root
> privilige on a VM. Meanwhile, I actually use Nixpkgs on PBS clusters,
> where there is no root privilige. It works really well. You can copy
> binary files with dependencies across nodes, as long as the HOME dir
> has the same name (a different HOME dir every time does away with
> correctness testing). For me userland Nix is very useful. I can
> compile on Debian (my desktop) and run on CentOS (our PBS).
>
> For CBL with root you can just install in the default /nix/store. No
> HOME dir in sight.
>
>> - There doesn't appear to be a community building nix packages for
>> biology. plink was recently added but beyond that there are no other
>> current tools:
>>
>> https://github.com/NixOS/nixpkgs/tree/master/pkgs/applications/science/biology
>
> That dir is mine. And, no, I am the only bioinformatician that I am
> aware of who is deploying Nix. I stopped sharing packages on github some
> years back. If CBL gets serious about Nixpkgs I will start
> contributing again.
>
> Writing packages in Nix is actually gratifying. Nix has great
> isolation support during builds - i.e. you figure out all dependencies
> (guaranteed) and once a package installs and runs, it will do so for
> years to come. That is very attractive for bioinformatics. Running
> several versions of software next to each other is attractive too. So
> you can some old version, with its dependencies, and the latests
> version, with dependencies. And they are both guaranteed to work.
>
> I LOVE THAT. I want my systems to be predictable. That is one reason I
> use Linux over Windows. That is one reason I use Nix over Homebrew
> when I create a new deployment protocol. Especially the Galaxy crowd
> should take note of that.
>
>> - It wasn't clear how to setup an experimental/CloudBioLinux-specific
>> channel for pushing packages. One of the nice things about the
>> current `custom` framework is that we can push fixes live immediately
>> for rapidly changing development work. It would be great to have a
>> non-stable channel able to do this.
>>
>> I could have spend more time on the install and channel issues but the
>> lack of existing packages or a nix biology community didn't give me a
>> push to do so. Am I missing repositories of existing packages we could
>> benefit from?
>
> Setting up your own channel can be done - and with enough packages -
> we can even get a channel on the central system of Nix with automated
> build testing provided.
>
> In fact, Debian, homebrew and Nix packages can co-exist happily. The
> one policy we should have, is to get rid of the hard-wired build
> scripts in CBL. They don't scale, for sure, and it is the road of
> least correctness (that software just overwrites stuff - at least a
> well-behaved Debian package won't). When a homebrew package works -
> great. When we want multiple versions and/or correctness, use Nix.

I look forward to investigating way to integrate Brew and Nix into the
Galaxy Tool Shed ecosystem, but support for the custom install stuff
has been implemented and will be integrated soon and it is a step
forward relative to the state of things in Galaxy. I hope support for
these is not dropped.

It seems there is a continuum of easy to correct. The custom installs
represent one end of that, brew somewhere in the middle, and nix the
other end. Given the history and state of this industry, it is not
that surprising that most correct is not the most popular.

There are going to be developers who want something done quickly and
can use the custom stuff, others who want to reach the widest audience
and will utilize brew, and still others that want to do things most
correctly, they should be able to utilize nix. CloudBioLinux seems to
have a history of not being very opinionated, and I don't think it
should start now. I am happy to be working in an ecosystem where
people can utilize nix support, I hope CloudBioLinux continues to be a
community where I can utilize these custom install procedures.

Thanks,
-John

>
> So, referring to Tim's cartoon, this is not about standards. This is
> about policies and useful tools with an eye on the future. I am
> talking from experience, it is not that I want to lead you into a dark
> future of hard work! In fact, Nix has saved my ass a few times :) All
> I can do is recommend it.
>
> Pj.

Pjotr Prins

unread,
Oct 7, 2013, 2:03:06 PM10/7/13
to John Chilton, cloudb...@googlegroups.com
On Mon, Oct 07, 2013 at 11:49:36AM -0500, John Chilton wrote:
> other end. Given the history and state of this industry, it is not
> that surprising that most correct is not the most popular.

Aye. I tend to say in bioinformatics we get away with murder.

One thing my boss is teaching me is that it is worthwhile pushing for
the best solution even if the state of affairs is discouraging. There
is no progress without visionaries pushing.

No worries about build scripts disappearing. I know Brad won't do
without those :). If my work requires me to deploy software again, I
will consider combining that with CBL. I just hope Luigi will take a
really hard look at Nix. And next time you work on the tool shed -
make sure you get proper versioning and transactions included. For
Galaxy it makes even more sense than for CBL because you deploy
long-running servers.

Pj.

Pjotr Prins

unread,
Oct 7, 2013, 12:12:16 PM10/7/13
to Brad Chapman, Pjotr Prins, cloudb...@googlegroups.com, John Chilton
Hi Brad,

On Mon, Oct 07, 2013 at 09:59:50AM -0400, Brad Chapman wrote:
> > I am not saying homebrew is a bad idea - these Ruby things tend to
> > gain momentum, and I am a Ruby guy - but still they haven't gotten all
> > things right. What you trade is convenience for correctness.
>
> What features do you feel like Homebrew is missing that Nix provides?

Two things, really. Multi-system binary support with guaranteed
correctness, and transactional installs.

It boils down that you can trust a Nixpkg to be what it presents -
with ALL its dependencies. In a non-correct system, there is no way
you can guarantee that a package has not been improperly compiled,
that libraries have not been overwritten during upgrades (say a CBL
compile after a Debian base install). Nix does away with all those
worries. Better even, you can bundle a software with its dependencies
and deploy it on another system, and you are guaranteed it is the same
software running. This is what we want in Science (reproducibility),
this is what we require in medicine (certified diagnostic tools).
People just don't realise what this really means - to have a correct
system.

Transactional installs means that the software will complain if the
install was not complete. Think SQL transactions. Nix does that.

> > It just hurts me that no one here even spends a single day seriously
> > trying nixpkgs - a system which I have been deploying for the last 5
> > years with gratifying results.
>
> I spent quite a bit of time with Nix based on your recommendation. The
> major issues I had were:
>
> - It was difficult for me to setup for a single user non-root account.
> What I'd like to be able to do is have CloudBioLinux install nix in a
> non-privileged directory and do all the setup/configuration inside of
> there. I ran into issues where nix wanted to put things into other
> directories by default and required root. There might be workarounds
> here, but I generally found the installation heavy-weight.

That is different than using Nix for CBL, right? You have root
privilige on a VM. Meanwhile, I actually use Nixpkgs on PBS clusters,
where there is no root privilige. It works really well. You can copy
binary files with dependencies across nodes, as long as the HOME dir
has the same name (a different HOME dir every time does away with
correctness testing). For me userland Nix is very useful. I can
compile on Debian (my desktop) and run on CentOS (our PBS).

For CBL with root you can just install in the default /nix/store. No
HOME dir in sight.

> - There doesn't appear to be a community building nix packages for
> biology. plink was recently added but beyond that there are no other
> current tools:
>
> https://github.com/NixOS/nixpkgs/tree/master/pkgs/applications/science/biology

That dir is mine. And, no, I am the only bioinformatician that I am
aware of who is deploying Nix. I stopped sharing packages on github some
years back. If CBL gets serious about Nixpkgs I will start
contributing again.

Writing packages in Nix is actually gratifying. Nix has great
isolation support during builds - i.e. you figure out all dependencies
(guaranteed) and once a package installs and runs, it will do so for
years to come. That is very attractive for bioinformatics. Running
several versions of software next to each other is attractive too. So
you can some old version, with its dependencies, and the latests
version, with dependencies. And they are both guaranteed to work.

I LOVE THAT. I want my systems to be predictable. That is one reason I
use Linux over Windows. That is one reason I use Nix over Homebrew
when I create a new deployment protocol. Especially the Galaxy crowd
should take note of that.

> - It wasn't clear how to setup an experimental/CloudBioLinux-specific
> channel for pushing packages. One of the nice things about the
> current `custom` framework is that we can push fixes live immediately
> for rapidly changing development work. It would be great to have a
> non-stable channel able to do this.
>
> I could have spend more time on the install and channel issues but the
> lack of existing packages or a nix biology community didn't give me a
> push to do so. Am I missing repositories of existing packages we could
> benefit from?

Setting up your own channel can be done - and with enough packages -
we can even get a channel on the central system of Nix with automated
build testing provided.

In fact, Debian, homebrew and Nix packages can co-exist happily. The
one policy we should have, is to get rid of the hard-wired build
scripts in CBL. They don't scale, for sure, and it is the road of
least correctness (that software just overwrites stuff - at least a
well-behaved Debian package won't). When a homebrew package works -
great. When we want multiple versions and/or correctness, use Nix.

luigi viscardi

unread,
Oct 9, 2013, 8:55:09 PM10/9/13
to cloudb...@googlegroups.com
Hi all,

I tried to split up some consideration by topic, hoping to get some clarification essential for understanding the right way to be taken

1) building packages

=== on 2013-10-02 pjotr wrote  ====

"Building a product on multiple packaging systems, targetting multiple
systems and adding a build system on top is a recipe for disaster. 
Ironically, we are dealing in software here. One of the great things 
about software is that you *can* make it rigorous, reproducible, 
whilst retaining flexibility."
  • personally I do not think the situation is so dire; most critical and massive-use environments successfully use these tools for years 
  • certainly you can improve, but I think it's a thing positive the ability to have different distributions (Ubuntu, Debian, CentOS ...) that work with a uniformity in the installation process using native tools, especially because the deploy processes guarantee reliability and stability
2) nixpkgs vs traditional pkgs manager
 
==== on 2013-10-06 Pjotr wrote: ====

"It just hurts me that no one here even spends a single day seriously 
trying nixpkgs - a system which I have been deploying for the last 5 
years with gratifying results. But I'll shut up now about what I think 
is the best way forward. I have made my point clearly and eloquently -"
  • nixpkgs è certainly one possible solution to evaluate carefully, because could resolve some the issue we are discussing; however, I need some explanations: 
    • are there some meaningful project that are using this solution?
    • what are the status of the project? how many people are working on it? how big and active is the community? 
==== on 2013-10-07 John wrote: ====

"It seems there is a continuum of easy to correct. The custom installs 
represent one end of that, brew somewhere in the middle, and nix the 
other end. Given the history and state of this industry, it is not 
that surprising that most correct is not the most popular."
  • I agree with this consideration:
    • but those who must make choices (and investiment) in the medium/long term (and whose the target is not the technical element) probably is focused on solution that are be able to offers a solid and stable support (I think also for researcher this is true)
    • when I say to my boss "we can use this wonderful package for ours critical environment", he says to me first: "what kind of support they provide?" (and only for second: "how much it costs?")  

3) cloud and traditional environment

==== on 2013-04-04 Tim wrote: ====

"If you are setting up a server or workstation and 
maintaining it over the lifetime of the hardware then the mess that 
accumulates as you try to update and reconfigure the system will 
eventually bring it to its knees (and rob the poor sysop of sanity). 
This equally applies to regular VMs, which are often run for years." 
....
But in the cloud, you often think of a compute node as "disposable".
If I want the latest CBL I start a new one from scratch, I don't try to 
upgrade an old one.  In this case, one has a lot more leeway when adding 
software because the need to update systems in-place can largely be 
ignored, and therefore CBL can prioritise useful features over uniform 
Debian-policy-style correctness.
  •  this feature is definitely the strong point of the cloud, and I think this topic is very important, but I have not fully understood what you wanted to say:
    • if is true that in the cloud everyone can start to build a new system from scratch without the need to make updates, is also true that the new system/machine must be the exactly clone of an existing environment, with the guarantee of the stability; and this is true both for cloud and traditional environment
    • in the cloud is most important be able to have a new system fastly, but is also important, primarily in an scope like CBL, that new system provides right results as the old system, possibly providing new features
      • It is fundamental that the new system may keep  the existing features without introduce errors not present in an outdated releases
      • it is very important that you can test the new one:
        • but test every time the whole is probably more complex than test only differences compared to a previous release
        • it is not useful to have a new system fastly without guarantee a reproducible result

"The CBL project is still going to be more tractable 
if the number of build system and custom installers can be kept down"
  • certainly, I think the target is to remove or reduce as far as possible the installation processes custom, at least in a stable environment, in order to have a guarantee of a reliable installation process
4) FHS, packages and custom 

==== on 2013-10-04 John wrote: ====

"CBL is many things to many people. I agree with Tim Booth to a point- 
namely certain aspect of CBL shouldn't be used in conjunction with 
long running systems - installing Python dependencies globally with 
pip, installing custom software directly into /usr/bin, etc... 
Everything installed into /usr/bin should be coming from the package 
manager. But just because some aspects of CloudBioLinux are unsuitable 
for traditional systems, doesn't mean they all are or that is 
shouldn't target things beyond cloud images. 
...
Namely I break the installation into two pieces, installing all of the 
needed OS packages (CBL does a good job here configuring 
apt/yum/etc... and differentiating which package is which on the 
different systems) and then another step where custom software is 
compiled or downloaded and installed. In this second case though it is 
all installed into its own directory (e.g. /opt/galaxy/tools) for 
instance. 
...
I think from a system administrator perspective - 
the important thing is they are in isolated directories outside of /usr."

==== on 2013-10-05 Brad wrote: ====

"By default the cloud and VM based installs put everything into one 
directory since the idea is to use the same automated process to 
create another one later and don't need to maintain it long term." 

John is 100% right with the current direction of CloudBioLinux. It 
broadly does three things right now: 
- Install system packages on multiple systems, leveraging Bio-Linux 
  on Ubuntu/Debian systems. This is by default into the system (/usr) PATH. 
- Install programming language specific libraries. This defaults into 
  wherever is best for specific languages. 
- Install custom packages. This can be in any directory and does 
  not need to interfere with the system installations.
...
For non-cloud program-specific installs I do exactly what John 
describes, install all of the custom things into an isolated directory 
you can add to the PATH, either manually or via modules. I use this with 
our bcbio-nextgen installer:"

  • but FHS is really one of the most important problem?
    • Linux (like unix of course) is based on FHS and identify with some precision the use of system directory:
      • /usr as a default system directory
      • /usr/local as a directory for the packages that are not part of the distribution
      • /opt as a directory for optional application software packages (like "custom" applications for example)
  • unfortunately  there is some confusion in the definition of "custom":
    • all specific software for CBL, if you considered CBL as a homogeneous environment can be defined "custom" against system packages
    • but not all this software is "custom" from the point of view of installing, since there are software "packaged" and software "not packaged" (and only in this sense they are "custom")
      • in fact software "packaged" (inside packages.yaml) are usually installed in system directory (/usr or /usr/local)
      • and only software "not packaged" (inside custom.yaml) could be installed in a different path (like /opt/CBL)
  • maybe could be useful a different classification (and splitting) of the specific software for CBL, since especially inside packages.yaml is present both system packages and specific software for CBL
    • unfortunately however, also with this splitting, the installation in a different path maybe more complex, due to the fact that with apt-get is not possible specify a path of installation, unless we proceed with download of the package and subsequent compiling
  • personally I think that is most important, not so much where are located the packages (primarily in a cloud environment), but how you can identify them, so as to keep under control the environment where you work
    • is not too useful if you are be able to put the packages properly, but you can't identify them exactly
    • if packages is installed not uniformly, but you are be able to know their version and their status, you know exactly what you are using 
      • and you are be able to replicate them on a different machine
    • partitioning (meaning where is positioned the software) is an  "engineering issue", but to the end-user (for example the researcher that must use the tool) doesn't matter to known where is located the package, but that the tool is exactly what he wants and he is be able to find it on each cloned system
5) development and stable environment

==== on 2013-10-02 pjotr wrote  ====

"I don't think splitting the git tree is the way forward. 
That would complicate things too much for CBL maintainers."
  • actually who want to use the CBL environment does not have many information likely to be useful for pratical and profitable use of the system: 
    • which OS release can I use?
    • what are stable packages (those that the community has tested and verified to be valid and that return correct (and consistent) result? 
    • what are the current trends and developments for the community and that you can follow? And if you want to be more conservative, you can use only a "stable" environment?
  • personally I think that a reorganization can be a helpful, primarily because you are be able to know exactly if you are using a stable (and critical) environment where changes must be done carefully or you are using a development environment where you can play tests and make changes freely

"What is needed is a versioning system that supports multiple versions 
of software in multiple distributions (Ubuntu, Debian, CentOS) and 
multiple versions thereof. Without reinventing the wheel."
  • regardless of what the versioning management system, I think it's important to have a reference point against to compare an environment (for having a difference beteen what is installed and what should be installed)
  • it can be complicated to manage? Maybe. But if the goal of CBL is to be useful, is need to provide all information that may make such, at least for those that are considered stable releases

"And no more YAML  files, please. Certainly not for configuration. 
YAML is too limited for that and not that easy to read."
  • I agree that YAML is not a good idea, but I think that is need to bring out scripts configuration information, to keep scripts clean and simple:
    • you can use xml
    • or you can use standard file configuration which informations are in the form "key=value"(like Apache or syslog-ng for example)
==== on 2013-10-05 Brad wrote: ====

"In terms of versioning for the custom non-system installs,
I've done some work the past couple of days to enable tighter integration
with Linuxbrew/Homebrew (https:github.com/homebrew/linuxbrew).
This provides nice versioning capabilities by using Git for the
underlying recipes. 
..."
  • perhaps might be appropriate even at the organizational level proceeding with a rationalization of the "https:github.com/chapmanb/cloudbiolinux" environment, that allows to manage organic developments processes
    1. creating a tree "stable" and a tree "development"
      • in this way those who are using the CBL environment knows exactly what are the packages that are been tested, and which ones are still under development status
    2. split up the system configuration from the specific CBL configuration
    3. identify exactly what releases are supported and their status
  • in the development tree identify what are the active projects and the tests in progress, so that everyone is be able to identify exactly what it needs:
    1. nixpkgs
    2. homebrew/linuxbrew
    3. upgrade and fabric extensions
    4. galaxy support
    5. ...
Thank you for all this interesting and helpful discussion
Luigi

luigi viscardi

unread,
Oct 10, 2013, 8:44:31 PM10/10/13
to cloudb...@googlegroups.com
Brad,

I tried to install "linuxbrew", following the instruction provided in "https://github.com/homebrew/linuxbrew":
  1. the directory ".linuxbrew/lib" exported in .bashrc ("LD_LIBRARY_PATH=~/.linuxbrew/lib") doesn't exist
    • is it perhaps ".linuxbrew/Library"?
  2. afterwards, I created the directory "/opt/CBL" and I ran the installation command:
    • brew install /opt/CBL
    • this command fails with the following error: "Error: No available formula for .rb"
  3. I also tried to install via fabric:
      • fab -H localhost install_brew:samtools
      • but this command fails again with the following error:
    piero@ubuntu1204-tesi-linuxbrew:~/tesi/cloudbiolinux$ fab -f fabfile.py -H localhost install_brew:samtools 
    Traceback (most recent call last):
      File "/usr/lib/python2.7/dist-packages/fabric/main.py", line 594, in main
        docstring, callables, default = load_fabfile(fabfile)
      File "/usr/lib/python2.7/dist-packages/fabric/main.py", line 156, in load_fabfile
        imported = importer(os.path.splitext(fabfile)[0])
      File "/home/piero/tesi/cloudbiolinux/fabfile.py", line 33, in <module>
        from cloudbio.utils import _setup_logging, _configure_fabric_environment
      File "/home/piero/tesi/cloudbiolinux/cloudbio/utils.py", line 12, in <module>
        from cloudbio.edition import _setup_edition
      File "/home/piero/tesi/cloudbiolinux/cloudbio/edition/__init__.py", line 9, in <module>
        from cloudbio.edition.base import (Edition, Minimal, BioNode,
      File "/home/piero/tesi/cloudbiolinux/cloudbio/edition/base.py", line 7, in <module>
        from cloudbio.cloudman import _configure_cloudman
      File "/home/piero/tesi/cloudbiolinux/cloudbio/cloudman.py", line 23, in <module>
        from cloudbio.package.shared import _yaml_to_packages
      File "/home/piero/tesi/cloudbiolinux/cloudbio/package/__init__.py", line 8, in <module>
        from cloudbio.package import brew
      File "/home/piero/tesi/cloudbiolinux/cloudbio/package/brew.py", line 10, in <module>
        from fabric.api import quiet, cd
    ImportError: cannot import name quiet

    What am I doing wrong?

    Luigi

    Brad Chapman

    unread,
    Oct 10, 2013, 9:07:12 PM10/10/13
    to luigi viscardi, cloudb...@googlegroups.com

    Luigi;

    > I tried to install "*linuxbrew*", following the instruction provided in "
    > https://github.com/homebrew/linuxbrew":

    You shouldn't need to manually install linuxbrew -- the CloudBioLinux
    `install_brew` command will take care of that for you.

    > - *brew install /opt/CBL*
    > - this command fails with the following error: "*Error: No available
    > formula for .rb*"

    I'm not totally sure what you're attempting here so not exactly positive
    the right advice to provide. You should follow `brew install` with the
    name of a homebrew package (ie. `brew install samtools`). But, you
    shouldn't need to know any brew internals to install via
    CloudBioLinux -- it handles everything automatically.

    > 3. I also tried to install via fabric:
    > - *fab -H localhost install_brew:samtools*
    > - but this command fails again with the following error:
    >
    > * from fabric.api import quiet, cd*
    > *ImportError: cannot import name quiet*

    You need a more recent version of fabric. `quiet` was adding in fabric
    1.5. Try doing a `pip install --upgrade fabric` and you should be good
    to go.

    Hope this helps,
    Brad

    luigi viscardi

    unread,
    Oct 11, 2013, 8:14:11 AM10/11/13
    to cloudb...@googlegroups.com
    >I'm not totally sure what you're attempting here so not exactly positive 
    >the right advice to provide. You should follow `brew install` with the 
    >name of a homebrew package (ie. `brew install samtools`). But, you 
    >shouldn't need to know any brew internals to install via 
    >CloudBioLinux -- it handles everything automatically. 
    >
    sorry ... tonight I must have been drunk ... I read "$WHERE_YOU_WANT" instead "$WHATEVER_YOU_WANT" in the install command :-(
    After the discussion of FHS and installation paths I thought of having to specify where I wanted to install the environment

    >You need a more recent version of fabric. `quiet` was adding in fabric 
    >1.5. Try doing a `pip install --upgrade fabric` and you should be good 
    >to go. 
    >
    I will try

    Thanks a lot
    Luigi

    Brad Chapman

    unread,
    Oct 15, 2013, 9:45:28 AM10/15/13
    to cloudb...@googlegroups.com

    Pjotr, John and Luigi;
    Thanks again for all this great discussion. Luigi did a great job of
    summarizing, so I only wanted to comment on a few points:

    Pjotr:
    >> What features do you feel like Homebrew is missing that Nix provides?
    >
    > Multi-system binary support with guaranteed correctness
    >
    > It boils down that you can trust a Nixpkg to be what it presents -
    > with ALL its dependencies. In a non-correct system, there is no way
    > you can guarantee that a package has not been improperly compiled,
    > that libraries have not been overwritten during upgrades (say a CBL
    > compile after a Debian base install). Nix does away with all those
    > worries. Better even, you can bundle a software with its dependencies
    > and deploy it on another system, and you are guaranteed it is the same
    > software running. This is what we want in Science (reproducibility),
    > this is what we require in medicine (certified diagnostic tools).
    > People just don't realise what this really means - to have a correct
    > system.

    I'm 100% agreed that this is critical, but my view is that it will come
    from virtual machines and lightweight containers rather than
    packaging/installation systems. The installations I want to manage with
    CloudBioLinux are complex with multiple third party tools and require
    correct system level Java, R, and Python installations. Being able to
    isolate this within a specific, defined environment and re-use existing
    packages sounds more feasible than maintaining many custom recipes. I'm
    banking on Docker being able to provide this once they support a wider
    variety of kernels.

    John:
    > I look forward to investigating way to integrate Brew and Nix into the
    > Galaxy Tool Shed ecosystem, but support for the custom install stuff
    > has been implemented and will be integrated soon and it is a step
    > forward relative to the state of things in Galaxy. I hope support for
    > these is not dropped.

    Definitely not. As Pjotr said, I need these too. However, I am hopeful
    we may be able to integrate some of these over to brew to not duplicate
    effort on maintaining and keeping up to date with the latest releases,
    as well as getting the versioning for free. The install process for
    using brew in CloudBioLinux right now automates all of the
    Homebrew/Linuxbrew setup, so it's essentially equivalent to the API for
    using custom. The only added dependency from an external user standpoint
    is Ruby.

    Luigi:
    > 1. creating a tree "stable" and a tree "development"
    > - in this way those who are using the CBL environment knows exactly
    > what are the packages that are been tested, and which ones are still under
    > development status

    I'm not sure about the best way to manage something like this. I
    consider everything available from CloudBioLinux "stable": the goal is
    to be able to push new updates quickly and rely on the external
    packaging tools to handle versioning and stability.

    Brad

    luigi viscardi

    unread,
    Dec 15, 2013, 7:06:51 PM12/15/13
    to cloudb...@googlegroups.com
    Hi all,

    I did some other tests and I wrote a bit of code.
    On the attachment you can see a draft with the explanations and a tarball if you want to try it.
    I hope it was clear enough.

    Luigi
    univ_tesi_manage_pkgs_custom.pdf
    cloudbiolinux-20131215.tar.gz

    luigi viscardi

    unread,
    Jan 19, 2014, 7:15:36 PM1/19/14
    to cloudb...@googlegroups.com
    Hi Pjotr,

    I did some tests in the NIX environment, and on the attachment you can see the results and some comments.

    I hope it was clear enough.

    Luigi
    univ_tesi_tests_nix.pdf

    pjotr...@gmail.com

    unread,
    Jan 20, 2014, 2:06:46 AM1/20/14
    to cloudb...@googlegroups.com
    Hi Luigi,

    Good thing you dove into nixpkgs! It is the only way to come up with a
    balanced view of software deployment.

    The problems you are facing are mostly about resolving dependencies.
    This is actually a good thing. Nix brings out such system issues,
    there are no silent dependencies which may bite later. When
    dependencies are missing they have to be added explicitly (such as
    bzlib) - it is bound to be included in the package archive. In my
    experience all these issues are easy to fix. And the JAVA JRE is used
    in many other Nix packages - it is easy to fix too.

    I suggest you post this document to the Nix mailing list for comments
    and explain clearly why you are doing this. That will give you a
    measure of community support too. I am certain you'll get a lot of
    pointers.

    From March onwards I can fix/add some packages again, if there is
    general interest. With these things we start benefitting when the
    bio-community becomes large enough. I am happy to invest in Nixpkgs
    again even if it plays a minor role in CBL. The immediate use case is
    to run special versions of software. Homebrew can do that too, but I
    don't get the fuzzy warm feeling that Homebrew gets it right.
    Remember, Nix allows you to easily control the versioning of the main
    softare and *all* its dependencies. And that is what you need for
    legacy software.

    Pj.


    On Sun, Dec 15, 2013 at 04:06:51PM -0800, luigi viscardi wrote:
    > Hi all,
    >
    > I did some other tests and I wrote a bit of code.
    > On the attachment you can see a draft with the explanations and a tarball if
    > you want to try it.
    > I hope it was clear enough.
    >
    > Luigi
    >

    Roman Valls Guimera

    unread,
    Mar 13, 2014, 5:45:53 AM3/13/14
    to cloudb...@googlegroups.com
    Have not tried Nix myself yet, but this post gives a pretty good introduction to the non-initiated:

    https://www.domenkozar.com/2014/03/11/why-puppet-chef-ansible-arent-good-enough-and-we-can-do-better/

    Interesting and thought provoking.
    > To unsubscribe from this group and stop receiving emails from it, send an email tocloudbiolin...@googlegroups.com.

    luigi viscardi

    unread,
    Jun 29, 2014, 7:14:50 PM6/29/14
    to cloudb...@googlegroups.com
    Manage versioning & reproducibility

    Abstract
    This paper will summarize the points that have been the subject of the thesis and that led to the implementation of the solution that you find in the attachment.
    In the attachment you can also find a more detailed explanation.

    The installation process 
    The deployment process is composed of the next specific phases:
    1. download cloudbiolinux.tar.gz (in attachment at this discussion)
    2. tar xfvz cloudbiolinux.tar.gz && cd cloudbiolinux/
    3. ./deploycbl.sh
    Reports of the installed software
    At the end of installation process is generated a reporting of the packages and libraries installed (comparing, whenever possible, with the list of packages included in the manifest):
    • packages:
      • debian-packages-installed.log: packages-base successfully installed
      • debian-packages-not_installed.log: packages-base not installed
      • debian-packages-version_ok.log: packages-base successfully installed with right version
      • debian-packages-diff_version.log: packages-base successfully installed with different version
      • custom-packages-installed.log: packages-custom successfully installed
      • custom-packages-not_installed.log: packages-custom not installed
      • custom-packages-diff_version.log: packages-custom successfully installed with right version
      • custom-packages-version_ok.log: packages-custom successfully installed with different version
    • libraries:
      • python-packages-installed.log: python libraries successfully installed
      • python-packages-not_installed.log: python libraries not installed
      • python-packages-version_ok.log: python libraries successfully installed with right version
      • python-packages-diff_version.log: python libraries successfully installed with different version
      • r-packages-installed.log:  R libraries successfully installed
      • r-packages-not_installed.log: R libraries not installed
      • r-packages-version_ok.log: R libraries successfully installed with right version
      • r-packages-diff_version.log: R libraries successfully installed with different version 
      • perl-libs-installed.log: perl libraries successfully installed 
      • perl-libs-not_installed.log: perl libraries not installed
      • ruby-libs-installed.log: ruby libraries successfully installed 
      • ruby-libs-not_installed.log: ruby libraries not installed
      • haskell-libs-installed.log: haskell libraries successfully installed 
      • haskell-libs-not_installed.log: haskell libraries not installed
    Utility commands
    Were made a set of the commands that generalize the execution of the common operations (where implemented) performed on different object (packages and libraries):

    • cblavail: this command verifies what are the packages and libraries available (in the repository)
    • cblinstall: this command installs a package or library
    • cbluninstall: this command uninstalls a package or library
    • cbllist: this command lists the packages and libraries installed
    • cblfind: this command checks if a package or library is installed
    • cblinfo: this command shows detailed information about a specific package or library
    • cblfile: this command shows the list of the files that are parts of a specific package o library and the installation path
    • cblversion: this command shows the version of a specific package or library
    • cblcompare: this command compare the the version of a specific package o library with the information contained in the manifest files
    The replication of an existing environment
    The replication process is composed of the next specific phases:

    1. download cloudbiolinux.tar.gz (in allegato alla presente discussione)
    2. tar xfvz cloudbiolinux.tar.gz && cd cloudbiolinux/
    3. ./replicatecbl_from_remote.sh -i <ip>

    Viscardi
    -------------
    cloudbiolinux.tar.gz
    univ_tesi_manage_versioning_reproducibility.pdf
    Reply all
    Reply to author
    Forward
    0 new messages