Special Images on Compute Nodes -> Interactive & Batch Jobs Stall

41 views

Skip to first unread message

Sidd Karamcheti

unread,

Apr 24, 2022, 8:04:57 PM4/24/22

to google-cloud-slurm-discuss

Hey folks,

I've followed the SchedMD instructions to set up a 5-partition SLURM cluster on GCP. Both my login and controller nodes are running the SchedMD images:

`projects/schedmd-slurm-public/global/images/family/schedmd-slurm-21-08-6-debian-10`

While each compute node in my cluster is running an Image based off of the Deep Learning VMs (Debian + NVIDIA Drivers + Python).

I'm having the following problem: I can launch jobs just fine via `sbatch` or `srun` but they never return anything (in the case of sbatch) and they hang indefinitely (in the case of srun).

Upon further inspection, when issuing an sbatch (for a simple `hostname`) command, I can see the commands spin up the appropriate number of machines, spin down, then requeue (because SLURM never sees the job results).

Similarly when I run an srun, I can manually ssh into the spun up node, and things work fine.

Do I need to run the HPC images on the compute nodes as well? If so, what's the best way to install custom dependencies on my compute nodes (e.g., for GPU/Python)?

Reply all

Reply to author

Forward

0 new messages