slurm-gcp with a2-highgpu

129 views

Skip to first unread message

Christoph

unread,

Nov 8, 2021, 8:56:37 AM11/8/21

to google-cloud-slurm-discuss

I am trying to set up the cluster to have GPU nodes of type a2-highgpu-2. These seem to work a bit differently from mounting a GPU to some standard VM configuration, in particular I have to set gpu_count: 0 and gpu_type: null in the terraform configuration. However, when initializing the cluster like this, I am unable to specify --gres=gpu:1 or --gpus=1 for SLURM jobs. I tried to manually update the SLURM configuration by updating the slurm.conf by including a Gres=gpu:2 after the respective nodes and by adding a gres.conf file containing NodeName=iol-compute-0-[0-7] Name=gpu File=/dev/nvidia[0-1]. Based on what I saw in the slurm-gcp setup.py and in the SLURM manual, these seem to be the relevant things to modify. After updating the cluster using scontrol reinit, sinfo does show the correct gres info and the individual nodes seem to contain the right gres file as well, but sbatch still complains about sbatch: error: Batch job submission failed: Requested node configuration is not available when specifying either --gpus=1 or --gres=gpu:1 for a job.

I am also wondering what the appropriate way of updating the slurm-gcp configuration is. I had to sudo change the slurm users password on the controller to update the cluster by modifying files in the /usr/local/etc/slurm directory. It feels like this is not the "appropriate" way to do it, since slurm-gcp does not know about these changes. I would prefer making persistent modifications to the setup.py and rerunning the cluster setup, but I am afraid that will reset any modifications I might have done to the controller (e.g. mounting a separate disk as the home directory).