Container plugin limit jobs running in parallel

Ajay Kurani

unread,

Apr 14, 2023, 12:05:35 PM4/14/23

to xnat_discussion

Hi XNAT Experts,

I have a question as we are implementing our first set of containers. Since we are running on an HPC, each pipeline container will essentially collect all of the relevant pathways, scan types, pipeline options and put into a configuration file and we will use that to submit a SLURM job on our HPC. Based on this workflow, the container would just close immediately after it creates the SLURM job for a given pipeline and would disappear from the processing dashboard I assume.

The issue becomes if the user wants to cancel the job or check job status, there would be nothing to cancel or check since the container is finished after it creates the SLURM submission. One workaround is to keep the container active and use event services to close the container once it checks that the job is complete our our supercomputer. However, if people are launching containers at a project level with hundreds to thousands of jobs (think of launching on ADNI or some other large scale AI project), it is not feasible to keep all containers open at the same time, unless there is some queuing mechanism.

We wanted to find out if there is a configuration setting that allows one to control how many containers are active at the same time, while the rest go into a queue? If it is not present, how are keeping large number of containers currently handled since there are only a finite amount of resources at a given time. Any tips or suggestions would be helpful.

Thanks,

Ajay

John Flavin

unread,

Apr 14, 2023, 1:45:02 PM4/14/23

to xnat_di...@googlegroups.com

TL;DR We rely on the Docker Swarm and Kubernetes backends to do this queuing and scheduling for us.

Full Answer

First, I want to acknowledge that running containers in an HPC environment and submitting to SLURM or some similar scheduler are not handled very well by the Container Service as it currently exists. It was designed for launching and monitoring jobs in a certain way, and HPCs just have a very different set of requirements and assumptions.

That is to say, I'm sorry that the Container Service doesn't meet this need very well. You’re not alone, this is a common situation that we are aware of, but right now, today, we don't have any solutions to the problems you’re facing. That answer may change in the future, but right now it is what it is.

Now, about your proposed workaround. I'll repeat the details to confirm I understand. You want to launch an HPC job from XNAT, and to do that you have a thin intermediate container that translates the job details between what the Container Service provides and what your scheduler needs. That works ok for launching but not for status monitoring or canceling. So you’re imagining that the container could run a persistent service that stays alive for the duration of the HPC job. You mentioned using the event service to kill the container once the HPC job is done, but I think you could make it fancier: the container could check for HPC job status updates / logs and proxy them back to XNAT, and it could proxy kill requests from XNAT down to kill the HPC job (though docker/swarm/k8s might not let your job continue running that long).

Now given that proposed solution, you’re thinking that the solution doesn't scale. The HPC can handle that many jobs because it has a queue and a scheduler, but if you’re running a bunch of little services in containers to monitor all those HPC jobs then you need a queue and a scheduler for them too.

My answer to this is that the Container Service doesn't implement any job queuing itself. We rely on the compute backends (Docker Swarm and Kubernetes) to handle that. This is why we recommend that bare Docker not be used to run your containers in production; it doesn't have any kind of mechanism to prevent you from overwhelming your compute resources, whereas Docker Swarm and Kubernetes do. If you’re running on a bare Docker backend then turn on Docker Swarm mode. I think it would also help to set the cpu and memory requests and limits in the Command, as these are passed on to the backend when launching the job.

I'll end with another note about your proposed workaround. I think it's very clever! In fact, that is exactly the kind of thing we have thought about building ourselves to solve this problem. We were thinking about having a persistent service that can listen for job launch requests from XNAT and translate them into submissions to various schedulers, monitor job progress and logs and report them back to XNAT, listen for kill or restart requests, etc. So we're absolutely thinking along the same lines.

The difference is that we were picturing building it so we could run one single instance of this service for all N jobs, but using what tools you had available you had to map the service and the job one-to-one which meant you would have to run N services for N jobs. This isn't a criticism, by the way—I applaud your analysis of the problem—it's just that the Container Service doesn't provide you with the tools you'd need to build the more scalable 1:N version. The advantage of your method, though, is that you can build it yourself in a reasonable amount of time, whereas our proposed solution doesn't currently have an estimated time for delivery other than "not soon".

John Flavin

Backend Team Lead

johnf...@flywheel.io

He/Him

flywheel.io

--
You received this message because you are subscribed to the Google Groups "xnat_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xnat_discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xnat_discussion/c26d13b7-dd52-4793-b4b8-d80e48e2a0afn%40googlegroups.com.

Ajay Kurani

unread,

May 30, 2023, 1:04:29 AM5/30/23

to xnat_discussion

Hi John,

Thank you for the detailed explanation. So to give you a bit more details as I think you had it mostly correct:

1) HPC runs SLURM and containers must be singularity (not docker do to root permission issues)

2) On a separate VM we are running a private docker registry to host the "pipelines" which are light docker containers that take the job info for a given pipeline (pathway, selected images, preconfigured pipeline options such as cores/ram, etc) and translates them into a SLURM command and send that and a job configuration file to the selected compute resource (1of 2 clusters). The singularity images with a json config file to submit the actual job.

3) As you pointed out, right now our solution looks like an N to N solution as you pointed out since there may not be a 1:N solution at this time for job monitoring. We would potentially run N docker containers on the VM for the number of jobs launched and based on what you suggest. Based on your comments it seems that we should use Docker Swarm on this VM if it has sufficient resources to run as many containers as jobs launched or use Docker Swarm to only run as many jobs as resources available on the VM.

---I have not used Docker Swarm up till this point so I may have some of the details incorrect.

---One workaround potentially for limited resources on the VM and help with scaling could be a single container for project level launches (still need to see if feasible to run docker within docker and performance issues),

4) I appreciate the "fancier" option of job killing/monitoring as it was along the lines of some of our discussions.

We will likely work on this option and do some testing for the limitations of the VM to see how practical it is with the current infrastructure or if we need to run this on a new physical server with more resources. Thanks for all of the suggestions!

Best,

Ajay

akluiber

unread,

Nov 16, 2023, 3:57:51 PM11/16/23

to xnat_discussion

Hi all,

Bringing this issue back up. Does anyone have an example of how to schedule/limit concurrent jobs for a container to a specific number in a docker swarm? My XNAT server is the swarm manager, and I have a GPU node getting used to perform some segmentation at the session level. If I have many new sessions coming in, however, the GPU's memory can get overwhelmed when the event service sends out the many jobs simultaneously to that node. I understand the scheduling is dependent on the docker swarm backend, but the documentation I can find regarding limiting the maximum number of concurrent jobs per node (i.e. --max-concurrent, --replicas, --replicas-max-per-node), is in the context of a docker service, which I don't think I have control of the parameters outside of placement constraints in the backend configuration?

Any suggestions?

Reply all

Reply to author

Forward