TL;DR We rely on the Docker Swarm and Kubernetes backends to do this queuing and scheduling for us.
Full Answer
First, I want to acknowledge that running containers in an HPC environment and submitting to SLURM or some similar scheduler are not handled very well by the Container Service as it currently exists. It was designed for launching and monitoring jobs in a certain way, and HPCs just have a very different set of requirements and assumptions.
That is to say, I'm sorry that the Container Service doesn't meet this need very well. You’re not alone, this is a common situation that we are aware of, but right now, today, we don't have any solutions to the problems you’re facing. That answer may change in the future, but right now it is what it is.
Now, about your proposed workaround. I'll repeat the details to confirm I understand. You want to launch an HPC job from XNAT, and to do that you have a thin intermediate container that translates the job details between what the Container Service provides and what your scheduler needs. That works ok for launching but not for status monitoring or canceling. So you’re imagining that the container could run a persistent service that stays alive for the duration of the HPC job. You mentioned using the event service to kill the container once the HPC job is done, but I think you could make it fancier: the container could check for HPC job status updates / logs and proxy them back to XNAT, and it could proxy kill requests from XNAT down to kill the HPC job (though docker/swarm/k8s might not let your job continue running that long).
Now given that proposed solution, you’re thinking that the solution doesn't scale. The HPC can handle that many jobs because it has a queue and a scheduler, but if you’re running a bunch of little services in containers to monitor all those HPC jobs then you need a queue and a scheduler for them too.
My answer to this is that the Container Service doesn't implement any job queuing itself. We rely on the compute backends (Docker Swarm and Kubernetes) to handle that. This is why we recommend that bare Docker not be used to run your containers in production; it doesn't have any kind of mechanism to prevent you from overwhelming your compute resources, whereas Docker Swarm and Kubernetes do. If you’re running on a bare Docker backend then turn on Docker Swarm mode. I think it would also help to set the cpu and memory requests and limits in the Command, as these are passed on to the backend when launching the job.
I'll end with another note about your proposed workaround. I think it's very clever! In fact, that is exactly the kind of thing we have thought about building ourselves to solve this problem. We were thinking about having a persistent service that can listen for job launch requests from XNAT and translate them into submissions to various schedulers, monitor job progress and logs and report them back to XNAT, listen for kill or restart requests, etc. So we're absolutely thinking along the same lines.
The difference is that we were picturing building it so we could run one single instance of this service for all N jobs, but using what tools you had available you had to map the service and the job one-to-one which meant you would have to run N services for N jobs. This isn't a criticism, by the way—I applaud your analysis of the problem—it's just that the Container Service doesn't provide you with the tools you'd need to build the more scalable 1:N version. The advantage of your method, though, is that you can build it yourself in a reasonable amount of time, whereas our proposed solution doesn't currently have an estimated time for delivery other than "not soon".