[slurm-users] Automatically cancel jobs not utilizing their GPUs

16 views
Skip to first unread message

Stephan Roth

unread,
Jul 2, 2020, 2:57:34 AM7/2/20
to slurm...@lists.schedmd.com
Hi all,

Does anyone have ideas or suggestions on how to automatically cancel
jobs which don't utilize the GPUs allocated to them?

The Slurm version in use is 19.05.

I'm thinking about collecting GPU utilization per process on all nodes
with NVML/nvidia-smi, update a mean value of the collected utilization
per GPU and cancel a job if the mean value is below a to-be-defined
threshold after a to-be-defined amount of minutes.

Thank you for any input,

Cheers,
Stephan

Steven Dick

unread,
Jul 3, 2020, 7:59:45 PM7/3/20
to Slurm User Community List
I have collectd running on my gpu nodes with the collectd_nvidianvml
plugin from pip.
I have a collectd frontend that displays that data along with slurm
data for the whole cluster for users to see.
Some of my users watch that carefully and tune their jobs to maximize
utilization.

When I spot jobs that are either not using their gpus effectively or
don't have them open at all, I email users.
Most are appreciative, as they didn't know their job wasn't working
correctly. Unapologetic repeat offenders find their jobs converted to
preemptive jobs with a job submit plugin and a change of QOS.

I considered writing something to kill jobs outright when they didn't
use the gpu resources they requested, but through the above approach,
I've found it unnecessary.
Reply all
Reply to author
Forward
0 new messages