GPU numbering on slurm

49 views
Skip to first unread message

John Hewitt

unread,
Oct 10, 2018, 3:09:13 PM10/10/18
to DyNet Users
I'm trying to run DyNet on a slurm cluster for the first time, and I'm running into GPU provisioning issues.

If you ask for a single GPU on slurm, as

srun --gres=gpu:1

slurm will give you an arbitrary GPU on a machine with at least one GPU free, and re-number the GPU such that it is gpu '0'.

So, nvidia-smi gives us something like 

|   0  TITAN V             On   | 00000000:82:00.0 Off |                  N/A |
| 28%   36C    P0    27W / 250W |      0MiB / 12066MiB |      0%      Default |

Even though this machine (I promise) has 4 gpus. This doesn't play well with DyNet, which attempts to request GPU:0, and then hangs before it is finally killed:

>>>python -c "import dynet" --dynet-devices CPU,GPU:0
[dynet] initializing CUDA
[dynet] CUDA driver/runtime versions are 9.0/9.1
[dynet] Request for 1 specific GPU ...
[dynet] Device Number: 0
[dynet]   Device name: TITAN V
[dynet]   Memory Clock Rate (KHz): 850000
[dynet]   Memory Bus Width (bits): 3072
[dynet]   Peak Memory Bandwidth (GB/s): 652.8
[dynet]   Memory Free (GB): 12.1928/12.6528
[dynet]
[dynet] Device(s) selected: 0Killed

A simple fix like asking for the GPU index (before re-indexing) that I was assigned by slurm does not work, as it is detected that only GPU0 exists.

I don't believe this is due to an installation issue, as if I request all 4 GPUs on the machine, everything works splendidly.

Happy for any help/suggestions; I'm following up internally on this as well.

-John

Graham Neubig

unread,
Oct 10, 2018, 3:14:30 PM10/10/18
to john.h...@gmail.com, DyNet Users
We use DyNet with slurm all the time, and have never had this problem.
Usually I just use the "--dynet-gpu" flag and it should request the
GPU that is provisioned by slurm.

Graham
> --
> You received this message because you are subscribed to the Google Groups "DyNet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dynet-users...@googlegroups.com.
> To post to this group, send email to dynet...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dynet-users/06580c86-b41b-4adc-a960-d1e8e2548e2e%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

John Hewitt

unread,
Oct 10, 2018, 3:32:48 PM10/10/18
to DyNet Users
Thanks for the note. That makes sense, and I've done a bit more exploration. As it turns out, even when I request 4 GPUs from slurm, I cannot ask DyNet to use any more than 1 GPU without being killed:

python -c "import dynet" --dynet-gpus 1
works just fine, as does requesting each of the GPUs specifically

but
python -c "import dynet" --dynet-gpus 2
is killed when attempting to request the 2nd GPU. The same behavior occurs when requesting any 2 (or more) GPUs by index.

Starting to wonder if this is an installation problem on my end. Going to work through that now.

-John

John Hewitt

unread,
Oct 13, 2018, 1:16:28 AM10/13/18
to DyNet Users
Considerably more sleuthing has brought us to the following:

When requesting 4 GPUs from slurm, all is well.

When requesting 1 GPU from slurm, and then attempting to provision 1 GPU from DyNet, the following:

$ python -c "import dynet" --dynet-gpu 1
[dynet] initializing CUDA
CUDA failure in cudaGetDeviceCount(&nDevices)
unknown error
terminate called after throwing an instance of 'dynet::cuda_exception'
 what():  cudaGetDeviceCount(&nDevices)
Aborted

This seems to match with the "slurm/DyNet GPU numbering problem" hypothesis, but I'm not quite sure how. I've attempted to install dynet on a 4-gpu-provisioned slurm interactive run as well as a 1-gpu-provisioned run, but the effect is the same -- DyNet only seems to work for me when I request all 4 GPUs on a machine, regardless of how many I tell DyNet to use.

Posting this in part for posterity/in case I find a fix, but any help would also be appreciated.

-John

Graham Neubig

unread,
Oct 13, 2018, 9:29:42 AM10/13/18
to John Hewitt, DyNet Users
OK, it might be better to post this as an issue on github to make sure
it doesn't disappear into the ether.
We use slurm on our cluster as well, and I've never had a problem with
this, so unless there's some way to reproduce the problem I'm not sure
what more I can do.

Graham
> To view this discussion on the web visit https://groups.google.com/d/msgid/dynet-users/a4059dde-8d2a-428c-80e7-a27236036b62%40googlegroups.com.

John Hewitt

unread,
Oct 23, 2018, 10:47:01 PM10/23/18
to DyNet Users
I haven't moved this over to an issue because some grepping around highlighted that all the GPUs on a single machine managed by slurm were down, but still accepting jobs, causing crashes. The differing GPU requests caused slurm to route to different machines... 

tl;dr: problem solved; the GPUs were broken, not dynet. Feeling a bit silly! :)

-John

Graham Neubig

unread,
Oct 23, 2018, 11:26:05 PM10/23/18
to John Hewitt, DyNet Users
Ok, glad it got resolved!

Graham

Reply all
Reply to author
Forward
0 new messages