[slurm-users] Problem with Cuda program in multi-cluster

339 views

Skip to first unread message

mohammed shambakey

unread,

Jul 4, 2023, 2:45:02 PM7/4/23

to Slurm User Community List

I work on 3 clusters: A, B, C. Each of Clusters A and C has 3 compute nodes and the head node. One of the 3 compute nodes has an old GPU in each cluster of A and C. All nodes, on all clusters, have Ubuntu 22.04 except for the 2 nodes with GPU (both of them have Ubuntu 18.04 to suit the old GPU card). The installed slurm version (on all clusters) is slurm 23.11.0-0rc1.

Cluster B has only 2 compute nodes and the head node. I tried to submit a sbatch script from cluster B (with a CUDA program) to be executed in any of clusters A or C (where a GPU node resides). Previously, this used to work, but after updating the system, I get the following error:

srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by srun)
srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by srun)
srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /hpcshared/slurm_vm/usr/lib/slurm/libslurmfull.so)
srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /hpcshared/slurm_vm/usr/lib/slurm/libslurmfull.so)
srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /hpcshared/slurm_vm/usr/lib/slurm/libslurmfull.so)

The installed glibc is 2.35 on all nodes, except for the 2 GPU nodes (glibc version 2.27). I tried to run the same sbatch script on each of clusters A and C, and it works fine. The problem happens only when trying to use the "sbatch -Mall" form cluster B. Just to be sure, I tried to run another sbatch program (with the multicluster option) that does NOT involve CUDA program, and it worked fine.

Should I install the same glibc6 on all nodes (2.33 or 2.33 or 2.34), or what?

Regards

Mohammed

Feng Zhang

unread,

Jul 5, 2023, 9:35:53 AM7/5/23

to Slurm User Community List

Mohamad,

It seems you need to upgrade the GCC on the GPU nodes of cluster A and C. The error message says that the srun needs newer GCC libs. Or you can downgrade your SLURM(like recompile it using GCC 2.27 or older) on cluster A/C.