Hi Brian,
Thanks for replying. On our hardware, GPUs allocated to a job by cgroup
sometimes get themselves into a state requiring a reboot.
Outside the job, a simple CUDA program calling the API function
cudaGetDeviceCount works happily. Inside the job, it returns an error code
of 3 (cudaErrorInitializationError).
At present, I have a TaskProlog that prods this API function and emails me
when there is a failure. It'd be nice if the nodes could drain themselves
without administrator intervention, rather than continuing to run waiting
jobs and so causing them to fail.
I can see a couple of ways to do it (e.g. sudo script in TaskProlog, or
playing with the cgroup hierarchy outside of slurm), but was wondering if
I had misunderstood the slurm docs and there was a simpler way.
Best,
Mark