Torch hangs on call to clone from another GPU

50 views
Skip to first unread message

Achal Dave

unread,
Dec 29, 2016, 3:35:34 PM12/29/16
to torch7
Hi,

We recently updated some NVIDIA drivers and cuda installation on our machines, and we're no longer able to properly use multiple GPUs on the same machine. This is likely not a torch issue, but I'm hoping someone might have seen a similar problem and have an idea as to what the issue is.

Here's the problematic code:

local nn = require 'nn'
local cudnn = require 'cudnn'
local cutorch = require 'cutorch'

c
= nn.Linear(1, 1):cuda()

c1
= c:clone() -- succeeds

cutorch
.setDevice(2)
print('cloning')
c2
= c:clone() -- freezes
print('cloned') -- never prints

Essentially, trying to clone a model formed on one GPU onto another GPU hangs. I tried debugging a little, and it seems to hang at this line: https://github.com/torch/torch7/blob/1fe19f2f7a054fd3935d6b19d092e99535483042/File.lua#L351. I tried stepping inside that call using mobdebug, but that also froze for some reason.

We're also seeing a similar issue with TensorFlow on the same machine, where multi-GPU training is freezing.

Hardware/software:
- Cuda version: 8.0
- cudnn version: 5.1.5
- GPUs: Titan X (Pascal)
- Update cutorch and cunn earlier today.

Again, I'm guessing this isn't a torch bug (or if it is, it's a bug that both torch and tensorflow have), but rather something wrong with our setup. Any hints or suggestions would be great!

Vislab

unread,
Dec 29, 2016, 5:45:44 PM12/29/16
to torch7
Are you using the latest NVIDIA drivers? If so, have you tried downgrading to a previous version and see if it works?

Ashton Fagg

unread,
Dec 29, 2016, 6:43:35 PM12/29/16
to torch7 on behalf of Achal Dave
What OS and what is the nVidia driver version?


On 29 December 2016 at 15:35, Achal Dave via torch7
<torch7+APn2wQd7fdE2z-2sQQrTzvG-B...@googlegroups.com>
wrote:
> --
> You received this message because you are subscribed to the Google Groups
> "torch7" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to torch7+un...@googlegroups.com.
> To post to this group, send email to tor...@googlegroups.com.
> Visit this group at https://groups.google.com/group/torch7.
> For more options, visit https://groups.google.com/d/optout.

Achal Dave

unread,
Dec 29, 2016, 7:02:15 PM12/29/16
to torch7 on behalf of Ashton Fagg
Argh, of course, updating to the very latest driver fixed the issue.

We were originally on driver version 367.35, where we were able to use multi GPU. For some reason, when we updated Cuda and CUDNN, we were no longer able to use multi GPU with this version. I tried updating to 367.57 (which worked on another machine), but multi GPU did not work with that version either. Finally, updating to 375.26 fixed the issue.

It's still strange to me that an older driver would break in this way, but at least things are working now. Thanks!

You received this message because you are subscribed to a topic in the Google Groups "torch7" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/torch7/qIXpuYi8PhU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to torch7+un...@googlegroups.com.

Rohit Girdhar

unread,
Jan 22, 2017, 2:45:46 PM1/22/17
to torch7
Updating this post for completion sake (I'm another user on the OP's machines):
The driver update itself didn't quite fix the problem, but a Linux kernel downgrade (to 2.6.32, from 3.1) fixed it for good.


On Thursday, December 29, 2016 at 7:02:15 PM UTC-5, Achal Dave wrote:
Argh, of course, updating to the very latest driver fixed the issue.

We were originally on driver version 367.35, where we were able to use multi GPU. For some reason, when we updated Cuda and CUDNN, we were no longer able to use multi GPU with this version. I tried updating to 367.57 (which worked on another machine), but multi GPU did not work with that version either. Finally, updating to 375.26 fixed the issue.

It's still strange to me that an older driver would break in this way, but at least things are working now. Thanks!

On Thu, Dec 29, 2016 at 3:43 PM torch7 on behalf of Ashton Fagg <tor...@googlegroups.com> wrote:
What OS and what is the nVidia driver version?


On 29 December 2016 at 15:35, Achal Dave via torch7
Reply all
Reply to author
Forward
0 new messages