How to reproducibly run multi-container workflows

22 views

Skip to first unread message

mle...@gmail.com

unread,

Jul 20, 2021, 12:25:20 PM7/20/21

to Singularity Community Edition

We want to replace most installed software on our clusters by singularity images because it provides less conflicts to admins (like juggling multiple versions of a program needed by different working groups) and more flexibility and reproducibility to users (each user can pull the Singularity images they need. The optimal case is that users just have to replace something like "/opt/Bio/BLAST/v1.1/bin/blast" by "singularity run /apps/blast-v1.1.sif"

Multiple commands must be chained together to a reproducible workflow. In my field (computational biology), there are multiple tools known for that and my colleagues use everything from plain bash scripts over make to Snakemake and Galaxy. Some of those also have the option to specify a container image. For example, Snakemake can pull containers on demand. CWL has an option to specify a Singularity image, too.

But if we chain together singularity images for our programs for better reproducibility, this immediately begs the question how to ensure that the workflow management tools themselves are available reproducibly. For example, Snakemake, Toil (implementing CWL) and doit need python and even for bash there are several "newer" constructs that might not be available if your sysadmin for example an old version runs on Debian oldoldstable. Not even starting with all the UNIX commands that deviate from POSIX in different ways on different platforms.

The ideal solution would be to start a Singularity container within another but that seems not possible due to security reasons (https://groups.google.com/a/lbl.gov/g/singularity/c/NO4gz4zbuTg)

So how do you do it? I have two suboptimal options below but I would be happy to know your approaches.

Best regards,

Moritz

-------------

Option 1: Provide specialized containers to each user on-demand, but this has several downsides

- Containers get huge (e.g. R + Python + geostatistics lib = 1GB)

- a lot of duplication because each user needs a slightly different combination of programs

- Users requiring a new program inmidst their analysis need to build a new container. This might change versions of other programs, requiring a full rerun of their analysis. Analysis runs can take far too long for this.

Option 2: The container with the workflow tool has access to the host machine, for example by SSH to localhost. Downsides:

- Complicated for users to understand. Some struggle even with basic R and Linux commands and understanding the concept of one container is already complicated, let alone programs in one container calling back to the host and then again into another container.

- Host access must be secured against malicious access

- Admin intervention might be required to enable SSH access

Reply all

Reply to author

Forward

0 new messages