Hi,
We recently updated some NVIDIA drivers and cuda installation on our machines, and we're no longer able to properly use multiple GPUs on the same machine. This is likely not a torch issue, but I'm hoping someone might have seen a similar problem and have an idea as to what the issue is.
Here's the problematic code:
local nn = require 'nn'
local cudnn = require 'cudnn'
local cutorch = require 'cutorch'
c = nn.Linear(1, 1):cuda()
c1 = c:clone() -- succeeds
cutorch.setDevice(2)
print('cloning')
c2 = c:clone() -- freezes
print('cloned') -- never prints
Essentially, trying to clone a model formed on one GPU onto another GPU hangs. I tried debugging a little, and it seems to hang at this line:
https://github.com/torch/torch7/blob/1fe19f2f7a054fd3935d6b19d092e99535483042/File.lua#L351. I tried stepping inside that call using mobdebug, but that also froze for some reason.
We're also seeing a similar issue with TensorFlow on the same machine, where multi-GPU training is freezing.
Hardware/software:
- Cuda version: 8.0
- cudnn version: 5.1.5
- GPUs: Titan X (Pascal)
- Update cutorch and cunn earlier today.
Again, I'm guessing this isn't a torch bug (or if it is, it's a bug that both torch and tensorflow have), but rather something wrong with our setup. Any hints or suggestions would be great!