After some fighting, I have successfully built a custom image for my SLURM instances, based on Rocky Linux 8. On this image, I have also added a roughly 12GB python environment that our jobs use.
Previously, this environment was served by the controller, via NFS, which was a clear bottleneck when deploying hundreds of worker instances.
Does anyone have experience with baking data into the image, and does that improve performance? Who serves the images, and what is the bottleneck?
I'll give things a try, but I have a few more components to build before I can launch a test.