Singularity image download syncronization during multiple job startup

Mike Moore

unread,

Feb 26, 2019, 4:30:07 PM2/26/19

to singularity

Hi Everyone,

I've been trying to address some startup failures we have in our HPC environment when starting multiple jobs on the same node in Singularity. We have a run rate of ~30-50 million jobs per month. There are a lot of jobs that start simultaneously in the various workflows. During my testing, I am seeing situations where while the first job on a node is downloading the container from the registry, a second job is released to the same node. The second job tries to use the downloading image. Obviously, singularity fails because the image is incomplete, causing the second job to fail. After the image is downloaded and stored in the cache, multiple jobs can start in parallel without failing.

So, my questions...

Anyone dealing with a similar situation?
What have you done to address this conflict?
Is anyone using a container registry that is being accessed in parallel by your cluster or object store?
Are you storing your simg/sif files in a shared file system and running them directly?
Job prolog check for the existence of the container locally before starting?

I can see the finish line for deploying containers to our environment, this is the last big hill we need to climb...

Thanks
-Mike

Gregory M. Kurtzer

unread,

Feb 27, 2019, 10:42:34 PM2/27/19

to singularity

How are you running the job in question? Is it an MPI job and are you providing the Docker URL to run it?

If so, can you cache (or rather build) the image beforehand, put it on parallel storage, and run it from there? For example:

$ singularity build /scratch/container.sif docker://container

$ srun -np X singularity exec /scratch/container.sif /path/to/containerized/mpi_program

Greg

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

--

Gregory M. Kurtzer

CEO, Sylabs Inc.

Mike Moore

unread,

Feb 28, 2019, 11:36:34 AM2/28/19

to singularity

Hi Greg,

So, our current workloads are batch jobs. There is very little MPI at all currently. We use Univa Grid Engine for our scheduler. Our users submit job arrays with 100->1000s of tasks per job. Each task is similar, just on different data sets. We do have multiple (up to 20) tasks that can start on a node almost simultaneously.

We are deploying singularity images to convert our very large and customized bare metal image into a container. This is the first step in decoupling our applications from our bare metal installation and start a path to computing in the cloud. We went with Singularity because our shared environment makes docker a non-starter with our security dept.

I have been trying to get Sregistry working fast enough to move the project out of a POC for functional testing and into production use. However, the time to download the image to the system is unacceptably slow. I've patched Singularity 3.1.0 to address the missing Shub local caching support, which makes everything better if the image is locally cached. If the image is not cached and begins to download by the first task on a node, any additional tasks that are released to the node will fail when starting singularity because the image is incomplete. I have test cases (up to 500 tasks) with only a handful of tasks successfully completing because the first tasks are downloading the image and the additional tasks die because the download is in progress. Our image is on the order of 1.5-9.0GB depending on the software included and the function of the image. So, time to download the image is significant.

So, the way we have essentially been running our jobs are:

qsub -q all.q -t1-500 singularity run shub://sregistry/image <application>

Once the scheduler releases the job to the compute node, it starts singularity which downloads image from sregistry. We have written a wrapper that we will be using to force all jobs to run in singularity once our users have completed acceptance testing. But the issue of singularity crashing because the cached image is incomplete is a blocker.

I was trying to get an idea of what others were doing for storing and starting their containers in their HPC environments and if they have a similar issue with trying to start multiple instances of a container on the same system at the same time if the container is not cached yet.

v

unread,

Feb 28, 2019, 12:37:43 PM2/28/19

to singu...@lbl.gov

What if you used job dependencies, with a job do handle a single pull of all required containers coming before the batch?

https://slurm.schedmd.com/job_array.html#job-dependencies

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

--

Vanessa Villamia Sochat
Stanford University '16

(603) 321-0676

Gregory M. Kurtzer

unread,

Feb 28, 2019, 9:11:44 PM2/28/19

to singularity

Hi Mike,

I'd have the same recommendation, of first downloading and caching the container image, and then executing it in parallel, which is what most others are doing in my experience.

$ singularity pull ~/container.sif shub://sregistry/image

$ qsub -q all.q -t1-500 singularity exec ~/container.sif <application>

Not sure if that is possible given your workload, but that would be the easiest and most performant way to handle this. If not, let me know, happy to think through this in other ways.

Greg

On Thu, Feb 28, 2019 at 8:36 AM Mike Moore <wxdu...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Mike Moore

unread,

Mar 1, 2019, 1:45:02 PM3/1/19

to singularity

Hi Vanessa,

There are several issues with a Job dependencies approach. The first is that the job to download and cache the image would have to be run on all nodes. We don't explicitly assign jobs to a select group of hosts. It's a batch processing environment. Every CPU compute node can run every CPU compute job, and the same is true for our GPU nodes. There is no guarantee that the node that runs the download job will take the tasks.

From the admin side, we are trying to slide in containerization between the existing system image and the applications without requiring a massive change to the user workflows. A lot of our users are researchers who just run a tool provided to them from the tool builders. They are not cluster aware application developers. We are working with our toolbuilders on this project, but their focus is on new work, not trying to rework old workflows. So we have to make this as seamless as possible to the existing job submission workflows. Introducing job dependencies will likely break these workflows.

Our prolog wrapper is how we are going to force the workflows into containerization. We have two "default" container images (one for CPU jobs, one for GPU jobs) that are almost 100% identical to our current images. That 9 GB container image I mentioned in our collection is one of those default. Our toolbuilders have pushed back against the slow download times already. The efforts to enable caching, object storage backends, etc. were attempts to reduce the download times.

I had hoped that by enabling shub caching in Singularity that it would help, and it does so long as we don't run into this corner case. But knowing our workflows, users, and operational procedures, we are going to hit it regularly. Thats why I am trying to figure out any alternatives short of putting our containers into a shared directory. The shared directory model would work for today. But we have to move to the cloud and having a running shared directory that is not object based is costly.

Mike Moore

unread,

Mar 1, 2019, 1:55:23 PM3/1/19

to singularity

Hi Greg,

As I said to Vanessa, our sticking points are trying to get legacy workflows moved into containers in a non-intrusive way. But pre-pulling the image is going to be intrusive (requires workflow changes to the users) or will still run into the same issue (wrapper pulls to the local disk, but simultaneous task startups would run into each other). Short of putting some logic into the wrapper to look for some sort of lock file for an active download, I don't see how to work around this issue other than having the containers available everywhere, either by regularly syncing images to all nodes or by using a shared file system (NFS, GPFS, Lustre, etc.).

Just for a reference point, UNIVA's solution for the integration of Singularity and Grid Engine was to have a directory to store the images and a load sensor that looks for the available images in that directory and report that back to the scheduler. Then the user would ask for the image they want as a resource request and Grid Engine would run the job on nodes that reported that image. Not super cloud-friendly...

v

unread,

Mar 1, 2019, 2:10:38 PM3/1/19

to singu...@lbl.gov

HPC and cloud are very different use cases, and that seems to be the edge we are hitting here!

There are several issues with a Job dependencies approach. The first is that the job to download and cache the image would have to be run on all nodes. We don't explicitly assign jobs to a select group of hosts. It's a batch processing environment. Every CPU compute node can run every CPU compute job, and the same is true for our GPU nodes. There is no guarantee that the node that runs the download job will take the tasks.

This would be rationale for a shared filesystem, which is fairly common in HPC

From the admin side, we are trying to slide in containerization between the existing system image and the applications without requiring a massive change to the user workflows. A lot of our users are researchers who just run a tool provided to them from the tool builders. They are not cluster aware application developers. We are working with our toolbuilders on this project, but their focus is on new work, not trying to rework old workflows. So we have to make this as seamless as possible to the existing job submission workflows. Introducing job dependencies will likely break these workflows.

Singularity was (first) optimized for this use case (HPC), so a shared cache would be a good solution. The only other alternative would be through educating your users to pull to a file first, and then submit the file to the jobs. You don't need to use dependencies exactly, but the simple example that @gmk gave pulling first You are saying this isn't an option?

Our prolog wrapper is how we are going to force the workflows into containerization. We have two "default" container images (one for CPU jobs, one for GPU jobs) that are almost 100% identical to our current images. That 9 GB container image I mentioned in our collection is one of those default. Our toolbuilders have pushed back against the slow download times already. The efforts to enable caching, object storage backends, etc. were attempts to reduce the download times.

A shared filesystem cache, so you could download once, would be a typical HPC use case (sharing binaries between jobs) and help this, no? I see that you opened a PR to do that here: https://github.com/sylabs/singularity/pull/2776

I had hoped that by enabling shub caching in Singularity that it would help, and it does so long as we don't run into this corner case. But knowing our workflows, users, and operational procedures, we are going to hit it regularly. Thats why I am trying to figure out any alternatives short of putting our containers into a shared directory. The shared directory model would work for today. But we have to move to the cloud and having a running shared directory that is not object based is costly.

If you need to run a file, and the file is in some external server, it seems logical that you either need to download a gazillion copies of the same thing (not ideal) or share a few copies via shared systems (more ideal). The cloud is a different use case because it won't be too costly on the instance to download, but it would be a burden on the registry. This is good rationale for having a service that can handle that kind of concurrency - whether it means deploying multiple front ends to your own storage, or using something like Google Storage that can handle it. What could be other solutions? You could set up something complicated with Globus and then have a shared folder "somewhere else" but that just makes it more complicated. You could provide a wrapper for users that enforces a pull first, but it sounds like you don't want to do that.

v

unread,

Mar 1, 2019, 2:16:59 PM3/1/19

to singu...@lbl.gov

Kind of random, but this looks pretty cool, if you want some kind of distributed storage :)

https://github.com/Alluxio/alluxio

Mike Moore

unread,

Mar 1, 2019, 3:46:35 PM3/1/19

to singularity

On Friday, March 1, 2019 at 2:10:38 PM UTC-5, vanessa wrote:

HPC and cloud are very different use cases, and that seems to be the edge we are hitting here!

I completely agree. We are just starting to looking into this.

From the admin side, we are trying to slide in containerization between the existing system image and the applications without requiring a massive change to the user workflows. A lot of our users are researchers who just run a tool provided to them from the tool builders. They are not cluster aware application developers. We are working with our toolbuilders on this project, but their focus is on new work, not trying to rework old workflows. So we have to make this as seamless as possible to the existing job submission workflows. Introducing job dependencies will likely break these workflows.

Singularity was (first) optimized for this use case (HPC), so a shared cache would be a good solution. The only other alternative would be through educating your users to pull to a file first, and then submit the file to the jobs. You don't need to use dependencies exactly, but the simple example that @gmk gave pulling first You are saying this isn't an option?

Like I said, for the legacy workloads that we are just trying to lift into a container, this all needs to be invisible to the jobs and essentially the users. For new workflows, things can be very different. It is a tough problem. Add into this whole discussion that if there is any sort of deviation in the time it takes to complete a workflow, our users are on us about how "the grid is slow".... There is a lot of code/workflows created by researchers, not programmers/developers. So we have a lot of incredibly non-optimal workflows with huge inefficiencies. And it "can't be broken". So we are trying to introduce Singularity to abstract the workflows to enable us to move forward.

Our prolog wrapper is how we are going to force the workflows into containerization. We have two "default" container images (one for CPU jobs, one for GPU jobs) that are almost 100% identical to our current images. That 9 GB container image I mentioned in our collection is one of those default. Our toolbuilders have pushed back against the slow download times already. The efforts to enable caching, object storage backends, etc. were attempts to reduce the download times.

A shared filesystem cache, so you could download once, would be a typical HPC use case (sharing binaries between jobs) and help this, no? I see that you opened a PR to do that here: https://github.com/sylabs/singularity/pull/2776

Yes, that PR enables singularity to check the local cache for shub downloads. This feature was missing. But even with that fix, if two tasks are released to the same node nearly simultaneously and the required container is not in the cache, the first task starts the download to the cache, but the second task just sees the file name in the cache and tries to run it. If the download was sufficiently fast, this would be less of an issue. If it is cached, it is not an issue.

Just to give some data, here are some rough timing results I did on one of our GPU servers. I ran singularity to download one of our GPU containers (~4.5 GB in size), and run /bin/true. This gave a pretty good idea of the time to download and run the container.

Command(s) run                                                                                                                        time to run (s)
singularity exec shub://sregistry/centos7-cuda9.2-container /bin/true                                              169.7
sregistry-cli pull centos7-cuda9.2-container; singularity exec ~/centos7-cuda9.2-container /bin/true   50.1
singularity exec /gpfs/sregistry/centos7-cuda9.2-container /bin/true                                                 1.5 - 3.5
singularity exec ~/centos7-cuda9.2-container /bin/true                                                                   0.6
/bin/true   (run on bare-metal)                                                                                                       0.001

The second example (sregistry-cli/singularity) was to come up with a rough estimate of the time it would take if we downloaded the container directly from object storage. The third example is to just get an idea of how fast it would be if we ran the containers from our GPFS file system. That will be very variable based on cluster load and health. The fourth example is to get an idea of what running out of the singularity cache is like. My testing with Singularity 3.1.0 with my PR showed a runtime of ~45s for 500 tasks going through our scheduler.

I had hoped that by enabling shub caching in Singularity that it would help, and it does so long as we don't run into this corner case. But knowing our workflows, users, and operational procedures, we are going to hit it regularly. Thats why I am trying to figure out any alternatives short of putting our containers into a shared directory. The shared directory model would work for today. But we have to move to the cloud and having a running shared directory that is not object based is costly.

If you need to run a file, and the file is in some external server, it seems logical that you either need to download a gazillion copies of the same thing (not ideal) or share a few copies via shared systems (more ideal). The cloud is a different use case because it won't be too costly on the instance to download, but it would be a burden on the registry. This is good rationale for having a service that can handle that kind of concurrency - whether it means deploying multiple front ends to your own storage, or using something like Google Storage that can handle it. What could be other solutions? You could set up something complicated with Globus and then have a shared folder "somewhere else" but that just makes it more complicated. You could provide a wrapper for users that enforces a pull first, but it sounds like you don't want to do that.

I don't have a problem with the wrapper doing the pull. It is the corner case where one download is currently running while another job starts trying to use the same image on the same node. Some of this may be our own fault because we moved the singularity cache out of ${HOME} and into a shared local directory. We did this because 1) the GPFS home directory on the compute nodes is very limited - Only to be used to create your compute environment and is read-only on the compute/gpu nodes. 2) By using a shared cache, we reduce the amount of local storage used for image caching. I would just have to figure out a synchronization method to hold jobs if the image is being actively downloaded. The wrapper could do that.

1. Check for a lock file, if it exists, wait until it disappears and then run the container.
2. If there is no lock file, Check the local cache for the existence of the image.
a. If the image exists in the cache, run singularity
b. If it is missing, touch a lock file, pull the image, remove the lock file, and run singularity.

Like I said, I was just trying to see what experience others have had with this...

So, I guess my next question would be, Does Singularity itself support pulling from and object store directly using an S3 or Swift client? I know it handles docker/OCI, Singularity Library, Singularity Hub, and local file system paths. That would probably the be better fit overall instead of moving to a share file system. The transition to a public cloud would be easier with the container store being in object storage.

-Mike

v

unread,

Mar 1, 2019, 4:35:14 PM3/1/19

to singu...@lbl.gov

Yes, that PR enables singularity to check the local cache for shub downloads. This feature was missing. But even with that fix, if two tasks are released to the same node nearly simultaneously and the required container is not in the cache, the first task starts the download to the cache, but the second task just sees the file name in the cache and tries to run it. If the download was sufficiently fast, this would be less of an issue. If it is cached, it is not an issue.

If the idea here is that you are downloading the (final) file name to the cache, I don't think this is the correct way to go about it. Usually the file should be downloaded with some temporary extension, and then renamed (moved) at the end only when it's whole and complete. It is the case that the container might be downloaded by two nodes at the same time - but if the check is done for the file existing right before the move operation, there shouldn't be any sort of attempt by one process to use a partially downloaded from another. At worst, a few started the downloaded, one wins, and everyone else just removes theirs with the temporary extension.

The metrics are cool, and as I would expect.

I don't have a problem with the wrapper doing the pull. It is the corner case where one download is currently running while another job starts trying to use the same image on the same node. Some of this may be our own fault because we moved the singularity cache out of ${HOME} and into a shared local directory. We did this because 1) the GPFS home directory on the compute nodes is very limited - Only to be used to create your compute environment and is read-only on the compute/gpu nodes. 2) By using a shared cache, we reduce the amount of local storage used for image caching. I would just have to figure out a synchronization method to hold jobs if the image is being actively downloaded. The wrapper could do that.

I'm not sure why caching wasn't implemented in the way I've been accustomed to, with renaming after lots of checks. I believe this is called downloading atomically? I implemented a simple function for the original Singularity (when it had python) and we would want to add to that one more check that the file doesn't exist before renaming it.

I don't think you'd need the lock file if you did something like that, but others can correct me. What you don't want to do is stream/download into your final container... distaster.

So, I guess my next question would be, Does Singularity itself support pulling from and object store directly using an S3 or Swift client? I know it handles docker/OCI, Singularity Library, Singularity Hub, and local file system paths. That would probably the be better fit overall instead of moving to a share file system. The transition to a public cloud would be easier with the container store being in object storage.

This would be very good to see, I agree :) I've been trying to provide this support with Singularity Registry Client, but it's just a wrapper.

v

unread,

Mar 1, 2019, 5:01:29 PM3/1/19

to singu...@lbl.gov

So how can we do better? We have to think of the scope of Singularity itself - it should be optimized to run / otherwise interact with containers. Adding layers that it knows how to interact with a wide range of storage APIs is out of scope. What is in scope is asking for more general support to pull from a registry that conforms to OCI (the distribution specification). In that Singularity does support pulling a general web address, if you are just trying to download blobs, that should work. (e.g., try this:

$ singularity run https://57-167740989-gh.circle-artifacts.com/0/singularity-containers/vanessa/fortune/latest/bffb06f1d12ed52d62d51c60c46d1bfeba4530eb0ee563e2d3734e7a954537ce.sif

That's just a blob. So if you have something in storage that doesn't conform to the distribution spec, and also doesn't have an https address, you have to write some custom wrapper to retrieve the object (akin to sregistry client, this is what I was trying to provide) If it does have an https address you still need (likely) some special script to map a container URI to an https address. So could we fix everything with the distribution-spec? Not yet, really. The missing piece is that OCI doesn't let you (yet) define custom content types - because a client would need to know to pull a layer, find that it's SIF, and just download (and not decompress to build an image). But that seems like "the right way" to go about it, to me. The static registry I threw together had a (quick) example of this, try:

curl https://singularityhub.github.io/registry-org/singularityhub/busybox/manifests/latest/

So what would be a logical path to get Singularity to support *all the places?* Something like:

- distribution spec allows for custom media types

- Singularity (still) conforms to OCI and learns to read the content type, and know when a container binary is being pulled (no extraction necessary)

- Registries that conform to OCI have containers pushed with the correct content type

- Singularity pulls them. :)

In this case, an s3:// address, or anything not https, would just be a known type. The details need to be thought out with respect to a content type (sif) versus a storage type (s3 object vs https on some filesystem) but these are details - the important thing is that this would all be part of the specification, and Singularity could choose to support the types that it needs. I suspect it would be a list argument in the config :)

But I think, if Singularity is going as strongly OCI as it has been, this approach would be preferred. This way, you don't need to "hard code" some custom API endpoint into Singularity. We have some faith in the standard to be adopted.

The above is related to just pulling your containers. The next issue you bring up is harder, and that's workflows. Workflows that work seamlessly between cloud and HPC is an even bigger thing. You can take on some specific workflow manager that works across HPC and cloud (e.g., Cromwell, Nextflow, maybe Snakemake) but then you are adding layers of messy config files and this huge scary thing that the workflow manager might go away and you are screwed. You've probably noticed that there is no "singularity-compose" akin to docker-compose, or any easy way to orchestrate services. I'd be surprised if Sylabs isn't working on something like this. Kubernetes is great, but it doesn't solve the HPC problem. While I don't want to state that any of those examples are a good way to go for across regions and compute, I will say that we would theoretically want a tool that has mappings across places. And also abstracts the crapton of configuration files away from the user. I guess only time will tell what that tool turns out to be.

/crawls back into dinosaur cave

Is this a problem of

David Dykstra

unread,

Mar 1, 2019, 5:09:36 PM3/1/19

to singu...@lbl.gov

Mike,

This isn't a short-term answer, but the High Energy Physics answer to
the software distribution problem, including singularity images, is CVMFS:
https://cernvm.cern.ch/portal/filesystem

This is installed at many hundreds of universities and labs, but so far
has not made it onto most HPC systems. We keep pushing for it and I
think eventually it will be accepted, because it is simply such a good
technical solution. Publication of code is done at one place anywhere
in the world, and it is practically immediately available everywhere the
clients are installed. Files appear to be immediately present on all
the machines, but actually they're downloaded on demand when they're
accessed. It's a perfect architecture for HPCs because all of the
metdata operations are performed on the client nodes on downloaded
catalogs of files (typically broken up into nested catalogs referring to
100K files or so or less), and caching is done on each node and
centrally at each site (the latter in standard http caches). It
completely bypasses the local filesystems, although if the nodes have no
local disks they can each use a single file on the local highspeed
filesystem, mounted loopback for the per-node cache.

The cvmfs software is open source and installable by anyone from CERN:
https://cernvm.cern.ch/portal/filesystem/downloads
The Open Science Grid also supports el6 & el7 rpms:
https://opensciencegrid.org/docs/worker-node/install-cvmfs/

The Open Science Grid has a docker gateway that automatically publishes
configured images from docker, unpacked into a CVMFS repository. CERN is
making one too.

Dave

David Dykstra

unread,

Mar 1, 2019, 5:46:00 PM3/1/19

to singu...@lbl.gov

I forgot to mention that all the files are stored compressed and
deduplicated, which saves a lot of space. They are transferred
compressed over the network and uncompressed on the client. In the
client cache they are still de-duplicated, but uncompressed. All the
files are named internally by a secure hash, and are protected by a
single secure signature on each published version of software.

Dave

David Dykstra

unread,

Mar 6, 2019, 1:08:14 PM3/6/19

to Mike Moore, singu...@lbl.gov

Mike,

I now realize that I missed the earlier part of this thread, and I
just went back and re-read it from the archives and see that you are
interested in both HPC and cloud workflows.

In my opinion, singularity containers as one large image file is really
only practical when you have a fast shared filesystem like on HPC
systems. HPC administrators love Singularity compared to storing
software as individual files on their filesystem, because it mounts the
large file as a loopback filesystem, which moves almost all of the
metadata operations to the worker node instead of onto their limited-
resource metadata servers. In addition, the entire image files are not
downloaded to the worker nodes as a large blob this way; only the pieces
of the container that are actually accessed are downloaded, and only on
demand as they are accessed and not all at the beginning of a job.

By contrast, as you point out, in a cloud you don't have a large fast
filesystem directly mounted on all the nodes. This is much closer to
the grid model, and this is where CVMFS really shines and what it was
really designed and optimized for. As I said earlier, CVMFS also moves
all the metadata operations to the worker nodes, and only downloads the
pieces that are actually accessed, and only on demand as they are
accessed. Fermilab has successfully run a very large workflow of CMS
(the LHC experiment) jobs with this in Amazon's AWS. Now if we can just
persuade the HPC admins to also support CVMFS, we would have one
mechanism that works well everywhere. Currently to run on HPC systems
we instead mostly need to set up a large cvmfs client system somewhere
that periodically reads large subsets of what we have installed in CVMFS
and write them out to a big container to copy to each HPC system's
shared filesystem.

Dave

Reply all

Reply to author

Forward