[slurm-dev] Re: GPU allocation errors when submitting from outside the cluster

2 views
Skip to first unread message

Hagen Kerzmann

unread,
Dec 8, 2015, 10:30:57 AM12/8/15
to slurm-dev
I should also mention, that even the PATH hack does not fully solve the problem. I can easily submit GPU jobs to node-2, but for the other (which is also the one running slurmctld) I get

WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available  (error: cuda unavilable),

which is also the usual error one gets when not allocating a GPU via --gres. Now, this seems to be a different problem, but maybe you have some ideas.

Thanks in advance,

Hagen

2015-12-08 15:28 GMT+01:00 Hagen Kerzmann <hagenk...@gmail.com>:
Hi,

I have installed slurm-15.08.2 on a very small cluster of two machines, each featuring 4 NVidia GPUs. I want to submit jobs from another machine that has slurm installed, but no daemons running, so it is not part of the cluster. We mainly work with theano, so to test the GPU allocation in the cluster, I run a theano script that does some calculations on the GPU, if one is available. This works great for any jobs submitted on nodes within the cluster, using

srun --gres=gpu:1 sh theanoscript.sh

Submitting non-GPU jobs from the remote machine also works fine, but when I try to allocate one of the GPUs, theano throws the following error:

ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.

Of course, the GPU machines in the cluster have CUDA installed, so this error must be coming from the fact that the submitting machine does not. Therefore, I added the (non-existing) CUDA bin to my local PATH, and that actually fixed the problem, but of course, that is no desirable solution.

So, from my observations, srun somehow looks for the CUDA path on the remote machine, even though the job has already started to execute on one of the cluster nodes. How is that possible and how I can fix this without the mentioned hack? Does this occur because I submit the job from outside the cluster or because the submitting machine does not have CUDA installed?

My gres.conf file is the same on all machines (also on the one outside the cluster):

# Configure support for Titan GPUs
NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia0
NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia1
NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia2
NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia3

Best regards,

Hagen

Hagen Kerzmann

unread,
Dec 14, 2015, 5:05:02 AM12/14/15
to slurm-dev
Reply all
Reply to author
Forward
0 new messages