[slurm-users] Re: Container Jobs "hanging"

78 views
Skip to first unread message

Joshua Randall via slurm-users

unread,
May 28, 2024, 9:23:12 AM5/28/24
to slurm...@lists.schedmd.com
Hi Sean,

I appear to be having the same issue that you are having with OCI
container jobs running forever / appearing to hang. I haven't figured
it out yet, but perhaps we can compare notes and determine what aspect
of configuration we both share.

Like you, I was following the examples in
https://slurm.schedmd.com/containers.html and originally encountered
the issue with an alpine container image running the `uptime` command,
but I have also confirmed the issue with other images including ubuntu
and with other processes. I always get the same results - the
container process runs to completion and exits, but then the slurm job
continues to run until it is cancelled or killed.

I have slurm v23.11.6 and am using the nvidia-container-runtime, what
slurm version and runtime are you using?

My oci.conf is:
```
$ cat /etc/slurm/oci.conf
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
RunTimeQuery="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ state %n.%u.%j.%s.%t"
RunTimeKill="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ kill -a %n.%u.%j.%s.%t"
RunTimeDelete="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ delete --force %n.%u.%j.%s.%t"
RunTimeRun="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b"
```

Hope that we can get to the bottom of this and resolve our issues with
OCI containers!

Josh.


---
Hello. I am new to this list and Slurm overall. I have a lot of
experience in computer operations, including Kubernetes, but I am
currently exploring Slurm in some depth.

I have set up a small cluster and, in general, have gotten things
working, but when I try to run a container job, it runs the command
but then appears to hang as if the job container is still running.

So, running the following works, but it never returns to the prompt
unless I use [Control-C].

$ srun --container /shared_fs/shared/oci_images/alpine uptime
19:21:47 up 20:43, 0 users, load average: 0.03, 0.25, 0.15

I'm unsure if something is misconfigured or if I'm misunderstanding
how this should work, but any help and/or pointers would be greatly
appreciated.

Thanks!
Sean

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

--
Dr. Joshua C. Randall
Principal Software Engineer
Altos Labs
email: jran...@altoslabs.com

--
Altos Labs UK Limited | England | Company reg 13484917  
Registered
address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom,
WA14 2DT

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Joshua Randall via slurm-users

unread,
May 31, 2024, 5:39:45 AM5/31/24
to slurm...@lists.schedmd.com
Just an update to say that this issue for me appears to be specific to
the `runc` runtime (or `nvidia-container-runtime` when it uses `runc`
internally). I switched to using `crun` and the problem went away --
containers run using `srun --container` now terminate after the inner
process terminates.

--
Dr. Joshua C. Randall
Principal Software Engineer
Altos Labs
email: jran...@altoslabs.com
Reply all
Reply to author
Forward
0 new messages