I am trying to set up the cluster to have GPU nodes of type
a2-highgpu-2. These seem to work a bit differently from mounting a GPU to some standard VM configuration, in particular I have to set
gpu_count: 0 and
gpu_type: null in the terraform configuration. However, when initializing the cluster like this, I am unable to specify
--gres=gpu:1 or
--gpus=1 for SLURM jobs. I tried to manually update the SLURM configuration by updating the
slurm.conf by including a
Gres=gpu:2 after the respective nodes and by adding a
gres.conf file containing
NodeName=iol-compute-0-[0-7] Name=gpu File=/dev/nvidia[0-1]. Based on what I saw in the slurm-gcp
setup.py and in the SLURM manual, these seem to be the relevant things to modify. After updating the cluster using
scontrol reinit,
sinfo does show the correct gres info and the individual nodes seem to contain the right gres file as well, but
sbatch still complains about
sbatch: error: Batch job submission failed: Requested node configuration is not available when specifying either
--gpus=1 or
--gres=gpu:1 for a job.
I am also wondering what the appropriate way of updating the slurm-gcp configuration is. I had to sudo change the slurm users password on the controller to update the cluster by modifying files in the /usr/local/etc/slurm directory. It feels like this is not the "appropriate" way to do it, since slurm-gcp does not know about these changes. I would prefer making persistent modifications to the setup.py and rerunning the cluster setup, but I am afraid that will reset any modifications I might have done to the controller (e.g. mounting a separate disk as the home directory).
Thanks in advance for any help!