Kernel Fault on Torch (from git) with CUDA version 7.5.18, NVIDIA driver 352.68 (and 352.39)

2,349 views
Skip to first unread message

Binesh Bannerjee

unread,
Dec 16, 2015, 12:47:40 PM12/16/15
to torch7
Hi. Actually, my woes started before this, but this is how I replicated the problem.

I'm using Ubuntu 14.04.3, with CUDA version 7.5.18, and this happens on the NVIDIA driver 352.39 (which is distributed with CUDA) _and_ NVIDIA driver 352.68.

Basically, I was happily using Torch for a few months. (On a long running problem, so I didn't reboot for a while either.)

Then, I rebooted the box, and torch was behaving oddly. Looking into it, at least so far, this seems to be the _symptoms_:

th
require "cutorch"
[CTRL-D]

And, then it just sort of hangs. When I looked into kern.log it says: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000020"

and then there's a whole register dump, etc. that comes after it, which I've attached..

Is anyone else having this issue? I've reinstalled Ubuntu Server 14.04.3, OpenBLAS, CUDA and torch, and the problem persists.

Interestingly, when I run my homework from my Coursera GPU class, that works without error... But, I'll try to narrow it beyond the torch call, or at least, to find the minimal _Lua_ code that will trigger the bug, but, any help would be greatly appreciated!

Thanks!
Binesh Bannerjee
kern.log

Binesh Bannerjee

unread,
Dec 17, 2015, 3:06:46 AM12/17/15
to torch7 on behalf of Binesh Bannerjee
So... I just rolled back to CUDA 7.0.28 with NVIDIA driver 346.46, and same problem.

Just realized that I didn't specify my hardware configuration, so here's my nvidia-smi -L output

    GPU 0: Quadro M6000 (UUID: GPU-09446504-6a9e-866a-a65d-0f1d55b7657b)
    GPU 1: Tesla K40c (UUID: GPU-e992022a-724f-8f47-e08f-a954053020e6)
    GPU 2: Tesla K40c (UUID: GPU-4d14695e-3e43-bf43-a3e3-91190f696d39)

(I also attached the kern.log for when I rolled back... As you can see, it's very similar to the one from the newer driver... So, I don't _think_ it's the driver.. Maybe it's hardware? Maybe I'll start ripping out cards one by one, to see if it goes away)

I tried putting debug fprintf's inside all the c functions within extras/cutorch, but it doesn't print anything when I do that. I'll keep digging, but would really appreciate some advice/insight..

Thanks,
Binesh Bannerjee

kern.log
Message has been deleted

Binesh Bannerjee

unread,
Dec 17, 2015, 7:49:19 PM12/17/15
to torch7
OK, so I went ahead and started ripping out cards one by one.
It looks like Linux needs at least one video out to function,
or so it would seem. Because, when I put _just_ the Tesla K40's
in, it didn't seem to respond to ssh anymore. So, unfortunately,f
these were the only tests I could do:

    M6000, <empty>, <empty>    Torch works fine
    M6000, Tesla A, <empty>    Torch works fine
    M6000, Tesla B, <empty>    Torch works fine
    M6000, Tesla B, Tesla A    kernel faults.

So, it would _seem_ that it's having both the Tesla's in simultaneously
that's causing it to fail. Unfortunately, I suspect this puts me in a
rather small group. Can someone please please please, help me figure out
what's happening?

In the meantime, I'll start digging into figuring out _exactly_ what
cutorch is doing right before the kernel fault.

Thanks,
Binesh Bannerjee

soumith

unread,
Dec 17, 2015, 8:00:33 PM12/17/15
to torch7 on behalf of Binesh Bannerjee
Binesh, this seems like an annoying driver bug of some sort. 

Did you run the multi-GPU CUDA examples that are shipped with CUDA? Does it also have the same issue, or is it only an issue with cutorch...?

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at https://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

Binesh Bannerjee

unread,
Dec 17, 2015, 8:56:59 PM12/17/15
to torch7
Hi Soumith! THANK YOU for your response! I've been feeling a bit lost..

So, no, I did not try the examples shipped with CUDA... But, I just did and:

    root@gpu:~/NVIDIA_CUDA-7.5_Samples/0_Simple/simpleMultiGPU# ./simpleMultiGPU
    Starting simpleMultiGPU
    CUDA-capable device count: 3
    Generating input data...

    Computing with 3 GPUs...
      GPU Processing time: 10.178000 (ms)

    Computing with Host CPU...

    Comparing GPU and Host CPU results...
      GPU sum: 16777280.000000
      CPU sum: 16777294.395033
      Relative difference: 8.580068E-07

    root@gpu:~/NVIDIA_CUDA-7.5_Samples/0_Simple/simpleMultiGPU#

So, it looks like that compiles and runs fine. But, I don't really even have to do
anything with torch, I just run

    th
    require "cutorch"
    [CTRL-D]

(and say yes) and then I get the kernel fault.

Does that narrow anything down for you?

THANKS AGAIN!
Binesh Bannerjee


On Thursday, December 17, 2015 at 8:00:33 PM UTC-5, smth chntla wrote:
Binesh, this seems like an annoying driver bug of some sort. 

Did you run the multi-GPU CUDA examples that are shipped with CUDA? Does it also have the same issue, or is it only an issue with cutorch...?

Binesh Bannerjee

unread,
Dec 17, 2015, 9:15:38 PM12/17/15
to torch7
Maybe what I'll do (after I get back from Star Wars, hehe..) is reinstall Ubuntu but _not_ upgrade Linux and re-install torch, cuda, etc. and see if it's some combination of linux and the nvidia driver that causes it. (Only because it was working for at least 2 months and only stopped last week when I rebooted the box...)

Binesh

Binesh Bannerjee

unread,
Dec 18, 2015, 7:42:16 AM12/18/15
to torch7
OK, so I just reinstalled Ubuntu Server 14.04.3, and did _not_ update linux.
So, before, this was my configuration:

    Failed:
        Linux gpu 3.19.0-41-generic #46~14.04.2-Ubuntu SMP Tue Dec 8 17:46:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux NVIDIA 352.39 CUDA 7.5.18

Now, I have:
    Success:
        Linux gpu 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux NVIDIA 352.39 CUDA 7.5.18


Now, let's try updating ubuntu, and see if I get the kernel faults again. (Basically, all that
means as far as I'm concerned, is this:
    /usr/bin/apt-get update
    /usr/bin/apt-get dist-upgrade
    /usr/bin/apt-file update
    /usr/bin/apt-get autoclean
    /usr/bin/apt-get autoremove
    /etc/update-motd.d/98-reboot-required
    /bin/date

So, this seems to require re-running cuda_7.5.18_linux.run

And, BAM! kernel faults again!

        Linux gpu 3.19.0-41-generic #46~14.04.2-Ubuntu SMP Tue Dec 8 17:46:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux NVIDIA 352.39 CUDA 7.5.18


PHEW. I thought I was going crazy.. OK, so somewhere between Linux 3.19.0-25 and Linux 3.19.0-41
is the source of all my woes. *THAT* must be why it was working for so long and just suddenly stopped working after I rebooted.

I suppose I could narrow the linux version if someone could tell me how to install specific versions of linux on ubuntu. (And ideally, how to rollback too...)

I'm willing to do as much as I can to help figure this problem out, but I need some guidance.

Thanks!
Binesh Bannerjee

Binesh Bannerjee

unread,
Dec 18, 2015, 9:23:13 AM12/18/15
to torch7
OK, I think I figured out how to rollback and install particular versions.

Basically,
    apt-get install "linux-headers-3.19.0-${VERSION}" "linux-headers-3.19.0-${VERSION}-generic" "linux-image-3.19.0-${VERSION}-generic" "linux-image-extra-3.19.0-${VERSION}-generic"

or

    apt-get purge "linux-headers-3.19.0-${VERSION}" "linux-headers-3.19.0-${VERSION}-generic" "linux-image-3.19.0-${VERSION}-generic" "linux-image-extra-3.19.0-${VERSION}-generic"


So, I did that, and basically did a binary search between 3.19.0-25 (known to work) and
3.19.0-41 (known to kernel fault)...

So, what it seems to come down to, is

    3.19.0-33 works
    3.19.0-34 doesn't seem to exist.
    3.19.0-35 doesn't seem to exist.
    3.19.0-36 doesn't seem to exist.
    3.19.0-37 kernel faults.

So, it _seems_ the problem is between kernel 3.19.0-33 and kernel 3.19.0-37.

I'm not sure if that helps.

Binesh Bannerjee

soumith

unread,
Dec 18, 2015, 11:51:55 AM12/18/15
to torch7 on behalf of Binesh Bannerjee
That's some painful investigation :)
Glad it's over.
Message has been deleted

Binesh Bannerjee

unread,
Dec 18, 2015, 6:02:54 PM12/18/15
to torch7

Hi Soumith,
    Well.. But, is it really over? I guess I'd like to be able update my ubuntu box
as I normally do, and now, I sort of have to keep reverting back to an older version of
linux. I _am_ going to try ubuntu 15.10, just to see if it fixes things. I suppose if it
does, I have 4 months to get to another LTS version. Do you have any advice on digging
further? I'm sort of stymied by the fact that I put fprintf(stderr)'s immediately after
_every_ function (in every c file) in the cutorch directory, and when I hit CTRL-D it
doesn't print _anything_ at all, even though require "cutorch" does print a bunch of
things. It would _seem_ that it would be a __gc function of some kind. Do you know anything
about what functions might get called at exit?

I wonder. Is this useful to anyone other than myself? What distro of Linux does
facebook run? Am I wasting my time digging through where this kernel fault is happening?
I surely would rather be playing with neural networks.. Maybe I should switch to the distro
most other people are using... What is facebook (and everyone else) using? Is my choice
of ubuntu a noob thing?

Binesh Bannerjee

On Friday, December 18, 2015 at 11:51:55 AM UTC-5, smth chntla wrote:
That's some painful investigation :)
Glad it's over.

Vladimir Kadlec

unread,
Dec 23, 2015, 3:46:51 AM12/23/15
to torch7
Hi,
  thanks a lot for this post. We have exactly the same issue:
  • 3x GeForce GTX TITAN Black
  • Clean installation of Ubuntu 14.04.3 LTS.
  • With 3.19.0-42-generic kernel, torch aplications when using CUDA hangs at exit in kernel (zombie). Reboot command always hangs, hard-reset required. We tried different CUDA versions (6.5, 7.0, 7.5), different nvidia drivers (from Ubuntu repo, from nvidia website), all resulting the same problem.
  • With 3.19.0-33-generic, no problems at all, nvidia driver 352.68 from Ubuntu repos, CUDA 7.0 and 7.5.
I guess that's the nvidia kernel driver issue. I'm able to replicate the problem, if someone is interested in logs/debugs.

Binesh Bannerjee

unread,
Dec 27, 2015, 6:24:29 PM12/27/15
to torch7
Hi Vladimir... OK good... (Well.. Not good, but, you know what I mean.)

Anyway, I've made a bug report with nvidia, but I don't know how someone else can verify it. I'm also sending them a link here, so that hopefully they can see that at least two people see this problem... Maybe you could send a bug report to them too? I think you go here: https://developer.nvidia.com/nvbugs/cuda/add to submit one. My bug number tho, is 1713737, maybe you could reference it, so they know we're having similar issues?

They sent me this tho: "Could you please help to provide us a copy of nvidia-bug-report.log.gz log file? It could be generated by running driver utility nvidia-bug-report.sh under root user. Thanks."

Apparently you file your bug report, and then you send an email with the nvidia-bug-report.log.gz's attached to CUDAIssues at nvidia.com referencing that bug ID number. I sent them one with 3.19.0-33, one _before_ I run torch with 3.19.0-42, and one _after_ I run torch with 3.19.0-42... Running nvidia-bug-report.sh _after_ you run torch tho, hangs on (my system) with "cat /proc/driver/nvidia/./gpus/0000:05:00.0/information"... (If you do that on the command line it hangs as well.)

So.. Aside from all that, Happy New Year!

Thanks,
Binesh

Binesh Bannerjee

unread,
Jan 4, 2016, 4:12:03 AM1/4/16
to torch7
Hi.. So, for what it's worth, I sent this to nvidia, and my bug number is 1713737.. At first, they weren't able to reproduce the problem, but finally they were able to, so they have people looking into it. I asked if I could post their email to me here, and they said yes, so here it is:



Hi Binesh,
 
We have reproduced this issue finally on another test system which is a TYAN-B7079 platform with configured a Quadro M6000 + 2 Tesla K80 GPUs.
[Attached the observed results as following for your reference].
 
And now, we have assigned this issue to the appropriate developer team for further investigation. We’ll keep you posted on the bug report once we have a fix. Thanks.
Sorry for any inconvenience through brought by this problem.
 
====
[ 2065.284696] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 2065.284762] IP: [<ffffffffc101452d>] _nv002814rm+0x51d/0x610 [nvidia]
[ 2065.284763] PGD 0
[ 2065.284765] Oops: 0000 [#1] SMP
[ 2065.284797] Modules linked in: nvidia_uvm(POE) nvidia(POE) arc4 md4 nls_utf8 cifs fscache rfcomm bnep bluetooth snd_hda_codec_hdmi intel_rapl snd_hda_intel snd_hda_controller iosf_mbi snd_hda_codec ast joydev ttm snd_hwdep x86_pkg_temp_thermal drm_kms_helper intel_powerclamp drm snd_pcm snd_seq_midi coretemp kvm_intel snd_seq_midi_event syscopyarea kvm snd_rawmidi sysfillrect snd_seq sysimgblt snd_seq_device snd_timer crct10dif_pclmul snd crc32_pclmul soundcore aesni_intelaes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd shpchp mei_me ipmi_ssifsb_edac mei edac_core acpi_power_meter lpc_ich 8250_fintek ipmi_si wmi acpi_pad  ipmi_msghandler mac_hid parport_pc ppdev lp parport hid_generic igb i2c_algo_bit dca usbhid ptp hid pata_acpi pps_core [last unloaded: nvidia]
[ 2065.284801] CPU: 28 PID: 9672 Comm: luajit Tainted: POE  3.19.0-42 -generic #48~14.04.1-Ubuntu
[ 2065.284801] Hardware name: empty empty/FT77C-B7079, BIOS V1.03.B10 05/21/2015
[ 2065.284802] task: ffff880068a73ae0 ti: ffff8806ec6c8000 task.ti: ffff8806ec6c8000
[ 2065.284844] RIP: 0010:[<ffffffffc101452d>]  [<ffffffffc101452d>] _nv002814rm+0x51d/0x610 [nvidia]
[ 2065.284845] RSP: 0018:ffff8806ec6cba58  EFLAGS: 00010246
[ 2065.284846] RAX: 0000000000000000 RBX: ffff88078ceac008 RCX: 0000000000000000
[ 2065.284846] RDX: 0000000000000000 RSI: 0000000000000011 RDI: 0000000000000000
[ 2065.284847] RBP: ffff8807aff7af58 R08: ffff880783051a30 R09: ffff8810583d6700
[ 2065.284847] R10: 00000000568240a4 R11: ffffffffc1430db0 R12: 0000000000000000
[ 2065.284848] R13: 0000000000000001 R14: 0000000000000001 R15: ffff880858937c08
[ 2065.284849] FS:  00007f1f2bbff700(0000) GS:ffff88085fbc0000(0000) knlGS:0000000000000000
[ 2065.284849] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2065.284850] CR2: 0000000000000020 CR3: 0000000001c16000 CR4: 00000000001407e0
[ 2065.284851] Stack:
[ 2065.284852]  ffff88078ceac008 ffff88078328d008 ffff88085b5df808 0000000000000000
[ 2065.284853]  ffff88078460c008 ffffffffc100e540 ffff88078328d008 ffff88078ceac008
[ 2065.284855]  0000000000001100 0000000000000000 0000000000000024 ffffffffc1005291
[ 2065.284855] Call Trace:
[ 2065.284899]  [<ffffffffc100e540>] ? _nv003016rm+0xf0/0x1c0 [nvidia]
[ 2065.284938]  [<ffffffffc1005291>] ? _nv003007rm+0x11/0x50 [nvidia]
[ 2065.285029]  [<ffffffffc122a090>] ? _nv002020rm+0x2680/0x3c80 [nvidia]
[ 2065.285071]  [<ffffffffc147b869>] ? _nv000654rm+0x2b9/0x340 [nvidia]
[ 2065.285111]  [<ffffffffc1471e3a>] ? rm_disable_adapter+0x6a/0x130 [nvidia]
[ 2065.285152]  [<ffffffffc148dff6>] ? nv_uvm_notify_stop_device+0x46/0x70 [nvidia]
[ 2065.285193]  [<ffffffffc1482326>] ? nvidia_close+0x1a6/0x410 [nvidia]
[ 2065.285234]  [<ffffffffc147fc7d>] ? nvidia_frontend_close+0x4d/0xa0 [nvidia]
[ 2065.285238]  [<ffffffff811ee117>] ? __fput+0xe7/0x220
[ 2065.285240]  [<ffffffff811ee29e>] ? ____fput+0xe/0x10
[ 2065.285243]  [<ffffffff81091eac>] ? task_work_run+0xac/0xd0
[ 2065.285245]  [<ffffffff810773f0>] ? do_exit+0x2c0/0xb00
[ 2065.285247]  [<ffffffff81080ebf>] ? recalc_sigpending+0x1f/0x60
[ 2065.285248]  [<ffffffff81077cbf>] ? do_group_exit+0x3f/0xa0
[ 2065.285250]  [<ffffffff81083c30>] ? get_signal+0x1e0/0x710
[ 2065.285253]  [<ffffffff81014e70>] ? do_signal+0x20/0x120
[ 2065.285294]  [<ffffffffc147fcff>] ? nvidia_frontend_ioctl+0x2f/0x70 [nvidia]
[ 2065.285296]  [<ffffffff811ffd28>] ? do_vfs_ioctl+0x2f8/0x510
[ 2065.285299]  [<ffffffff810f10c1>] ? SyS_futex+0x71/0x150
[ 2065.285301]  [<ffffffff81014fd9>] ? do_notify_resume+0x69/0xb0
[ 2065.285305]  [<ffffffff817b79af>] ? int_signal+0x12/0x17
[ 2065.285317] Code: 16 75 00 31 c9 44 89 f2 be 2c 00 00 00 48 89 c7 ff 50 20 48 85 c0 49 89 c4 0f 84 d6 00 00 00 31 c9 31 d2 be 11 00 00 00 4c 89 e7 <41> ff 54 24 20 be 30 00 00 00 48 8b b8 a8 05 00 00 48 89 c3 ff
[ 2065.285358] RIP  [<ffffffffc101452d>] _nv002814rm+0x51d/0x610 [nvidia]
[ 2065.285359]  RSP <ffff8806ec6cba58>
[ 2065.285360] CR2: 0000000000000020
[ 2065.292853] ---[ end trace ff3f27fe0d1c23e2 ]---
[ 2065.292854] Fixing recursive fault but reboot is needed!
 
===
 
Thanks,
Kevin

Gijs Molenaar

unread,
Jan 22, 2016, 5:09:42 AM1/22/16
to torch7
Hi!

Sorry for breaking into your mailinglist, but we have exactly the same issue with a 4 GPU system with the same ubuntu/kernel/driver version.

Did you manage to solve the problem? Did nvidia gave an other response? Is this bug report 1713737 public somewhere? I was unable to find it.

greetings,

 - Gijs

Op maandag 4 januari 2016 11:12:03 UTC+2 schreef Binesh Bannerjee:

Binesh Bannerjee

unread,
Jan 22, 2016, 8:46:40 AM1/22/16
to torch7
Hi Gijs... Unfortunately, no. I will ask my nvidia for an update, but hopefully you can fill another bug report and reference the same bug number. Apparently it's related to particular motherboards... I'm unfortunately stuck with just doing this for my update process:

#! /bin/bash
/usr/bin/apt-get update
/usr/bin/apt-get dist-upgrade
/usr/bin/apt-get install \
   
"linux-generic-lts-vivid" \
   
"linux-headers-generic-lts-vivid" \
   
"linux-image-generic-lts-vivid"

for i in 41 42 43 47
do
    REMOVETHESE
="${REMOVETHESE} linux-headers-3.19.0-${i} linux-headers-3.19.0-${i}-generic linux-image-3.19.0-${i}-generic linux-image-extra-3.19.0-${i}-generic"
done
/usr/bin/apt-get purge ${REMOVETHESE}

/usr/bin/apt-file update
/usr/bin/apt-get autoclean
/usr/bin/apt-get autoremove
/etc/update-motd.d/98-reboot-required
/bin/date


It's of course a purely temporary "solution", but maybe it'll be helpful to you.. Every time ubuntu bumps up the linux version, I hope that the bug gets fixed, but it hasn't yet, so I add the latest number to the "for i" loop...

Binesh

Gijs Molenaar

unread,
Jan 25, 2016, 3:14:25 AM1/25/16
to torch7
Hi Binesh,

You are fixing the kernel to 3.19.0-33-generic right? That works for us also. Did you know disable the auto update of the kernel by removing the linux-generic meta package(s)?

greetings,

 - Gijs


Op vrijdag 22 januari 2016 15:46:40 UTC+2 schreef Binesh Bannerjee:

Binesh Bannerjee

unread,
Jan 25, 2016, 10:49:19 AM1/25/16
to torch7
Hi Gijs, do you mean the linux-generic-lts-vivid package?

Yeah, if you remove that (the meta package), and install 3.19.0-33-generic, then it'll stay at 33... But, I wanted to be able to keep testing to see if any new kernels would fix the problem... So, I was updating and checking when I'd see a new kernel release, and if it didn't, then I was just removing that _specific_ version (which if it's the latest, also removes the meta package...)

But, if you just remove the meta package, you're right, you don't have to do all of what I wrote above..

Binesh

Vasili Ramanishka

unread,
Feb 29, 2016, 1:41:35 PM2/29/16
to torch7
Hi!

As a workaround try to use CUDA_VISIBLE_DEVICES=0 (0 is a number of GPU) to constrain torch from erroneous behavior with resource allocation on several GPUs simultaneously even if it's not going to use all of them. 
Sure, it works only if your code uses only one GPU at a time.  

Binesh Bannerjee

unread,
Feb 29, 2016, 1:45:33 PM2/29/16
to torch7 on behalf of Vasili Ramanishka
I should have posted this before, but I got this from someone on my bug report:

Last comment from NVIDIA (Ryan L. - 02/14/2016 1:37 AM):
Hi Binesh,
The bug is fixed in our development driver. We'll inform you once the driver is official released.
Thanks,
Ryan

Of course, it's been two weeks, and I still don't see a new cuda library... So...

But, thanks Vasili, I'll give that a shot... I'm tired of being stuck at the same linux version all this time.

Binesh


--
You received this message because you are subscribed to a topic in the Google Groups "torch7" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/torch7/kLusyLEj4oc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to torch7+un...@googlegroups.com.

Nicholas Leonard

unread,
Feb 29, 2016, 2:15:05 PM2/29/16
to torch7
I also have the same problem. Keep us posted on NVIDIA resolution. Thanks!


On Monday, February 29, 2016 at 1:45:33 PM UTC-5, Binesh Bannerjee wrote:
I should have posted this before, but I got this from someone on my bug report:

Last comment from NVIDIA (Ryan L. - 02/14/2016 1:37 AM):
Hi Binesh,
The bug is fixed in our development driver. We'll inform you once the driver is official released.
Thanks,
Ryan

Of course, it's been two weeks, and I still don't see a new cuda library... So...

But, thanks Vasili, I'll give that a shot... I'm tired of being stuck at the same linux version all this time.

Binesh

Nicholas Leonard

unread,
Feb 29, 2016, 2:57:01 PM2/29/16
to torch7
Hi Binesh. I ran your script and reinstalled cuda 7.5. I can install and require torch/nn/cutorch/cunn. But as soon as I try to exit the torch interpreter after requiring cutorch, it hangs and I can't reboot, require cutorch from another terminal, and so on. Did you find a workaround for this?

Binesh Bannerjee

unread,
Feb 29, 2016, 3:02:57 PM2/29/16
to torch7 on behalf of Nicholas Leonard
Nicholas, what does it say if you do uname -a? 

I get: Linux gpu 3.19.0-33-generic #38~14.04.1-Ubuntu SMP Fri Nov 6 18:17:28 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

basically, any linux above 3.19.0-33 has this issue. Are you using ubuntu?

Binesh

Nicholas Leonard

unread,
Feb 29, 2016, 3:28:40 PM2/29/16
to torch7
Linux rhea 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb 26 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux


On Monday, February 29, 2016 at 3:02:57 PM UTC-5, Binesh Bannerjee wrote:
Nicholas, what does it say if you do uname -a? 

I get: Linux gpu 3.19.0-33-generic #38~14.04.1-Ubuntu SMP Fri Nov 6 18:17:28 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

basically, any linux above 3.19.0-33 has this issue. Are you using ubuntu?

Binesh
Message has been deleted

Binesh Bannerjee

unread,
Feb 29, 2016, 5:59:45 PM2/29/16
to torch7
Nicholas,
It's because in my script, I have


for i in 41 42 43 47

which is only removing linux versions 3.19.0-41, 3.19.0-42, 3.19.0-43 and 3.19.0-47. I did this on purpose, so that it would allow me to _try_ newer versions of linux as they came out. So, you have to modify my script to also delete 3.19.0-51... So, just change that line to

for i in 41 42 43 47 51

and it should work again... You'll have to hard power down tho, to reboot unfortunately...

BTW, the script is by _no means_ a fix, it is _simply_ a way to keep the older version of linux around till nvidia gives us a new version of CUDA that doesn't have this problem.

Binesh

Vasili Ramanishka

unread,
Feb 29, 2016, 8:37:43 PM2/29/16
to torch7
Nicholas, could you check this for your case? Just prepend the run of th interpreter with variable:
 
$CUDA_VISIBLE_DEVICES=0 th

Binesh Bannerjee

unread,
Feb 29, 2016, 9:03:47 PM2/29/16
to torch7 on behalf of Vasili Ramanishka
Vasili,

Really interesting. So, I did this, and sure enough, it works fine. So, I decided to run some tests.

export CUDA_VISIBLE_DEVICES=0 works
export CUDA_VISIBLE_DEVICES=1 works
export CUDA_VISIBLE_DEVICES=2 works

So, then I decided to run pairs:

export CUDA_VISIBLE_DEVICES=0,1 works

and

export CUDA_VISIBLE_DEVICES=0,2 works

BUT,

export CUDA_VISIBLE_DEVICES=1,2

causes the all to familiar hang on exit.. Now, when I was tearing out the cards, I was unable to put in _only_ two Tesla K40c's, because the motherboard seemed to _need_ my Quadro M6000. On my box, 0 is the QM6000, 1 and 2 are both Tesla K40c's... So, it seems this sort of more narrowly defines the problem. It looks like the problem is in having two Tesla's on the same box. Of course, hopefully all this is merely "academic" since NVIDIA says they have a fix that will solve this in the next release... We shall see... But, thanks!

Binesh


On Mon, Feb 29, 2016 at 8:37 PM, Vasili Ramanishka via torch7 <torch7+APn2wQdoA-__eTlVHOVU5p4-C...@googlegroups.com> wrote:
Nicholas, could you check this for your case? Just prepend the run of th interpreter with variable:
 
$CUDA_VISIBLE_DEVICES=0 th

Nicholas Leonard

unread,
Mar 1, 2016, 11:12:47 AM3/1/16
to torch7 on behalf of Binesh Bannerjee
Hi Binesh,

I updated your script to delete my version of kernel files. But even after I run and reboot, the version is still 51, any ideas how to fix that?

Regards,

Nicholas Léonard
917-741-7570

Jérémy Morvan

unread,
Mar 3, 2016, 11:08:48 AM3/3/16
to torch7
Hi all,

I have the same issue here on several multi-GPU systems. My workaround is to always have a 'nvidia-smi' running which strangely avoid the issue of hanging when requiring cutorch. I guess I will wait for this official release as well.

George Stoyanov

unread,
Mar 20, 2016, 2:41:21 PM3/20/16
to torch7
I have the same issues with one Tesla K10 GPU. The workaround that worked for me was by using:
CUDA_VISIBLE_DEVICES=0 th - to run the first GPU chip
CUDA_VISIBLE_DEVICES=1 th - to run the second GPU chip

This way I can use both chips in two different processes.

Binesh Bannerjee

unread,
Mar 29, 2016, 2:09:42 AM3/29/16
to torch7
Hi everyone!

Possibly good news! I got this from NVIDIA:


Hi Binesh,

Linux driver version 361.28 should contain the fix of this issue already, please verify this problem again with v361.28 driver and let us know if it works. Thanks.
Download link for v361.28 driver: http://www.nvidia.com/download/driverResults.aspx/98373/en-us

My tests suggest it works... I installed kernel 3.19.0-56-generic, and I ran "require 'cutorch'" and exited, and that worked fine.. Right now, I'm running something else, and it's going.. But, maybe you guys could test it out as well...

Binesh

Karthik Sarma

unread,
Apr 1, 2016, 4:57:09 PM4/1/16
to torch7
We were having the same issue on our 4 Titan X machine. Thanks to finding this thread, I was able to fix it with 361 from the graphics-drivers ppa (https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa) on kernel 3.19.0-56-generic

Thanks so much Binesh for following up with the solution! Hopefully we won't see too many problems with using such a fresh driver.

Karthik

Tatsunosuke Shimada

unread,
Apr 8, 2016, 10:57:50 PM4/8/16
to torch7
Has anyone managed to get it working on for Tesla Series such as K80?
Tried the latest driver version 352.79 but still couldn't solve the problem...
Maybe problem due to kernel version??
My installed kernel should be 3.19.0-58-generic or something...



Shimada

Binesh Bannerjee

unread,
Apr 9, 2016, 4:59:54 AM4/9/16
to torch7
Shimada, 352.79 is not the latest driver..  Try the 361.28 driver: http://www.nvidia.com/download/driverResults.aspx/98373/en-us

It works for me.

Binesh

sp

unread,
Apr 9, 2016, 10:03:46 PM4/9/16
to torch7
I have same problem with Tesla K40C however, the driver  361.28 doesn't support K40C. Have someone have solution for Tesla series? 

On Saturday, April 9, 2016 at 4:59:54 AM UTC-4, Binesh Bannerjee wrote:
Shimada, 352.79 is not the latest driver..  Try the 361.28 driver: http://www.nvidia.com/download/driverResults.aspx/98373/en-us .  

sp

unread,
Apr 9, 2016, 10:05:03 PM4/9/16
to torch7
I am having same problem with Tesla K40C. Would you mind letting me know what is 

CUDA_VISIBLE_DEVICES=0 th - to run the first GPU chip
CUDA_VISIBLE_DEVICES=1 th - to run the second GPU chip 

These means??? 

On Sunday, March 20, 2016 at 2:41:21 PM UTC-4, George Stoyanov wrote:

Binesh Bannerjee

unread,
Apr 9, 2016, 10:08:41 PM4/9/16
to torch7
sp,
I have the Tesla K40C, and I'm using 361.28... See my attached shell output...

Binesh
shell.txt

Sangpil Kim

unread,
Apr 10, 2016, 1:05:17 AM4/10/16
to torch7 on behalf of Binesh Bannerjee
Thank you for your advise. Then just simply update the driver with this http://www.nvidia.com/download/driverResults.aspx/98373/en-us ? 
There are no mention on the device support k40C so I thought it's only for geforce and other graphic cards. It seem it will work for K40C as well. 
I will try to update with that linked driver and let you know. 

Sincerely, 
Sangpil Kim

--

Tatsunosuke Shimada

unread,
Apr 10, 2016, 11:11:34 PM4/10/16
to torch7
Thanks Binesh!
It worked!!!!!!!!

My env was Ubuntu with K80.


On Sunday, April 10, 2016 at 2:05:17 PM UTC+9, sp wrote:
Thank you for your advise. Then just simply update the driver with this http://www.nvidia.com/download/driverResults.aspx/98373/en-us ? 
There are no mention on the device support k40C so I thought it's only for geforce and other graphic cards. It seem it will work for K40C as well. 
I will try to update with that linked driver and let you know. 

Sincerely, 
Sangpil Kim

Nicholas Leonard

unread,
Apr 11, 2016, 1:59:15 PM4/11/16
to torch7 on behalf of Tatsunosuke Shimada
Hi guys,

the new driver fixes it. I was able to make it work with Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-51-generic x86_64) . I don't think the exact kernel matters as others have made it work with 56 instead of 51.

The first I did was uninstall nvidia packages:

sudo apt-get remove --purge nvidia-*

After that I downloaded the cuda .run installer (https://developer.nvidia.com/cuda-downloads ):

wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run

And then I installed everything except the samples and driver :

chmod o+x cuda_7.5.18_linux.run
sudo ./cuda_7.5.18_linux.run

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/361.28/NVIDIA-Linux-x86_64-361.28.run
chmod o+x NVIDIA-Linux-x86_64-361.28.run
sudo ./NVIDIA-Linux-x86_64-361.28.run

I did not select the option to include the 32bit libraries as it generated an error (file conflict).

Note that I also needed to close the x-server for the above to work :

sudo service lightdm stop

Hope this helps,


Nicholas Léonard
917-741-7570


On Sun, Apr 10, 2016 at 11:11 PM, Tatsunosuke Shimada via torch7 <torch7+APn2wQfSO7c3__1_tnuc54Kse...@googlegroups.com> wrote:
Thanks Binesh!
It worked!!!!!!!!

My env was Ubuntu with K80.


On Sunday, April 10, 2016 at 2:05:17 PM UTC+9, sp wrote:
Thank you for your advise. Then just simply update the driver with this http://www.nvidia.com/download/driverResults.aspx/98373/en-us ? 
There are no mention on the device support k40C so I thought it's only for geforce and other graphic cards. It seem it will work for K40C as well. 
I will try to update with that linked driver and let you know. 

Sincerely, 
Sangpil Kim

Sangpil Kim

unread,
Apr 11, 2016, 3:26:07 PM4/11/16
to torch7 on behalf of Nicholas Leonard
Dear Nichola,

Thank you for your advise. Would you mind letting me know which GPU model your are using?

Sincerely, 
Sangpil Kim

Nicholas Leonard

unread,
Apr 15, 2016, 4:36:47 PM4/15/16
to torch7 on behalf of sp
Titan Black and X

Nicholas Léonard
917-741-7570

Reply all
Reply to author
Forward
0 new messages