We have had a similar problem, even with different partitions for CPU
and GPU nodes, people still submitted jobs to the GPU nodes, and we
suspected running CPU type jobs. Doesn't help to look for the missing
--gres=gpu:x because a user can ask for GPUs and simply not use them. We
thought of getting into GPU usage checks but that isn't ideal either, in
part because it makes things pretty messy if you want to get real GPU
usage (and we did it for a while using NVIDIA's API for that), and in
part because there are legitimate jobs which need a GPU but not
intensively (e.g. some reinforcement learning experiments).
The main currency on our cluster is the fairshare score. We do not use
shares as credit points, rather as a resource that gets eroded as per
resource consumption. We assigned tres billing weights on the GPU nodes
such that allocating one GPU on a four GPU node would automatically
charge you max(N/4, M/4, G/4) if N, M, and G were cores, memory, and
number of GPUs. To make this work we also used PriorityFlags=MAX_TRES in
slurm.conf.
Now we don't have to worry about someone taking all the RAM and just 1
CPU and 1 GPU on a node. They "pay" for the resource that they consume,
maximally. We did have a problem where someone would allocate just 1
GPU, a few CPU cores, and almost all the RAM, effectively rendering the
node useless to others. Now they pay almost for the entire node if they
do, which is the fairest charge, because nobody else can use the node.
Works for us also because we use preemption across the cluster (1h
exemption) and jobs get preempted based on job priority. The more anyone
consumes resources, the lower their fairshare score, which results in
lower job priorities.
Relu