Thanks so much John for the prompt reply here.
I checked /monitoring and could not find any signs of OOMing or any blocked threads (they are either TIMED_WAITING or WAITING).
I did however notice that the queued containers (or worker nodes) do successfully complete and shutdown. For these scans, the UI shows "Processing finished. Uploading files.". It remains stuck there for some time (longer than usual) after which some succeed while others error out with the following:
2024-04-17 03:44:12,868 [DefaultMessageListenerContainer-6] ERROR org.nrg.containers.services.impl.ContainerFinalizeServiceImpl - Container 638: Could not upload files to resource: Unable to acquire thread and process locks
2024-04-17 03:44:12,879 [DefaultMessageListenerContainer-6] ERROR org.nrg.containers.services.impl.ContainerFinalizeServiceImpl - Cannot upload files for command output nifti:nifti-resource
It is still puzzling to me why the same scan would otherwise run successfully with less than ~10 scans batched together. This may be a disk I/O bottleneck? For context, this is running on AWS and the XNAT home is an EFS (NFS) mounted onto an EC2. I've not had previous issues with a bad mount or anything, and the disk I/O metrics for EFS did not appear off.
What kind of knobs does XNAT expose to control the container scheduling? I looked at JMS queue and the docker swarm constraints, but was hoping to control e.g. the number of workers per node as a short term solution. If not, maybe I can invoke this container via the CS API and pass a few scans at a time?
I am going to test the backend, but it seems that the backend is obeying the XNAT scheduler and not spinning all containers at once. The fact that a good amount of memory is free supports this.
Thanks!