Unable to scale container service + batch launch beyond several scans

38 views
Skip to first unread message

Ahmed Hosny

unread,
Apr 16, 2024, 12:33:46 PMApr 16
to xnat_discussion
Hello - 

I am trying to run dcm2niix on some scans using container service + batch launch. I am able to successfully process ~10 scans in a few seconds without issues. For anything larger than that, it takes forever and does not complete any of them. The app becomes very slow and unusable until I restart tomcat. Could this be related to the queueing of containers and some sort of bottleneck?

I am using XNAT 1.8.9, CS 3.4.1, and batch launch 0.6.0. I tried running with both docker and docker swarm. I do not think it is OOMing as XNAT is running with 20GB RAM (in tomcat.service) with another 12GB left on the device. I also tried messing around with the JMS queue numbers but that did not have an effect?

Any insight would be much appreciated!

Ahmed

John Flavin

unread,
Apr 16, 2024, 1:13:18 PMApr 16
to xnat_di...@googlegroups.com
The app becomes very slow and unusable until I restart tomcat. Could this be related to the queueing of containers and some sort of bottleneck?

That's certainly a possibility. We made some improvements to the Container Service with the version you have, 3.4.1, that were aimed at improving concurrent handling of lots of container launches. But in my testing, "lots" for me was on the order of thousands of simultaneous container launches, and we were able to handle that. If you’re hitting problems around 10, that tells me the CS internals should be able to handle what you’re throwing at it and the problem must be elsewhere. But where, exactly, is another question. I'm not sure what the answer is, but I could point you to a couple things to check.

Is there anything you can see in the JavaMelody monitoring console (at http://{your XNAT}/monitoring)? You can check memory and CPU, of course, but I would also look for threads that might be blocked.

Is your compute backend (docker, swarm, kubernetes) throttling the number of containers that can run simultaneously? For instance, if you are running on swarm, do you have a fixed number of nodes and each node can only take so many jobs or something along those lines? If you cut the container service out entirely and wrote a script to directly tell your compute backend to launch 50 instances of a container that does some nonsense like "sleep 60; echo hello world", is there a limit to how many of those can run simultaneously?

Hopefully you can find something that looks suspicious, and we can investigate further in that direction.

John Flavin


--
You received this message because you are subscribed to the Google Groups "xnat_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xnat_discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xnat_discussion/5718f888-a1c9-40ed-bf39-9c4f05a378e5n%40googlegroups.com.

Ahmed Hosny

unread,
Apr 17, 2024, 12:16:16 AMApr 17
to xnat_discussion
Thanks so much John for the prompt reply here.

I checked /monitoring and could not find any signs of OOMing or any blocked threads (they are either TIMED_WAITING or WAITING). 

I did however notice that the queued containers (or worker nodes) do successfully complete and shutdown. For these scans, the UI shows "Processing finished. Uploading files.". It remains stuck there for some time (longer than usual) after which some succeed while others error out with the following:

2024-04-17 03:44:12,868 [DefaultMessageListenerContainer-6] ERROR org.nrg.containers.services.impl.ContainerFinalizeServiceImpl - Container 638: Could not upload files to resource: Unable to acquire thread and process locks
2024-04-17 03:44:12,879 [DefaultMessageListenerContainer-6] ERROR org.nrg.containers.services.impl.ContainerFinalizeServiceImpl - Cannot upload files for command output nifti:nifti-resource

It is still puzzling to me why the same scan would otherwise run successfully with less than ~10 scans batched together. This may be a disk I/O bottleneck? For context, this is running on AWS and the XNAT home is an EFS (NFS) mounted onto an EC2. I've not had previous issues with a bad mount or anything, and the disk I/O metrics for EFS did not appear off.

What kind of knobs does XNAT expose to control the container scheduling? I looked at JMS queue and the docker swarm constraints, but was hoping to control e.g. the number of workers per node as a short term solution. If not, maybe I can invoke this container via the CS API and pass a few scans at a time?

I am going to test the backend, but it seems that the backend is obeying the XNAT scheduler and not spinning all containers at once. The fact that a good amount of memory is free supports this.

Thanks!

John Flavin

unread,
Apr 18, 2024, 6:28:25 PMApr 18
to xnat_di...@googlegroups.com
What you describe there is an error coming from within XNAT core. When Container Service has files to upload to a resource, it hands that off to XNAT to manage the creation of the resource and archiving the files into it. At the beginning of that process it creates this ThreadAndFileLock that you see in the error message. For a given resource path (which is basically a directory) the ThreadAndFileLock acquires an in-memory lock to prevent other parts of XNAT from writing this resource, and it writes a lock file into the file system. If either parts of this fail, it keeps trying to acquire them for two minutes before ultimately giving up and producing an error. (I think this hard-coded two-minute retry time explains why these containers take so long to finish finalizing.)

Why is that happening? I don’t know. My first guess is something with the file system being off so that it can’t write these lock files. But that really is nothing more than a guess. 

As to container scheduling, the Container Service doesn’t really do anything that deserves the name. It just submits container launch jobs as fast as they come in, and relies on the compute backend to be able to handle that. If you need to be able to throttle the number of jobs per node, you should be able to configure that on the compute backend itself. You are able to configure resource needs on your commands which CS will include in the container launch requests it submits to the compute backend, which allows the compute backends to make their own scheduling decisions. 

John Flavin

On Apr 16, 2024, at 11:16 PM, Ahmed Hosny <i.ahme...@gmail.com> wrote:

Thanks so much John for the prompt reply here.
Reply all
Reply to author
Forward
0 new messages