--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at https://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.
Binesh, this seems like an annoying driver bug of some sort.Did you run the multi-GPU CUDA examples that are shipped with CUDA? Does it also have the same issue, or is it only an issue with cutorch...?
That's some painful investigation :)
Glad it's over.
Hi Binesh,
We have reproduced this issue finally on another test system which is a TYAN-B7079 platform with configured a Quadro M6000 + 2 Tesla K80 GPUs.
[Attached the observed results as following for your reference].
And now, we have assigned this issue to the appropriate developer team for further investigation. We’ll keep you posted on the bug report once we have a fix. Thanks.
Sorry for any inconvenience through brought by this problem.
====
[ 2065.284696] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 2065.284762] IP: [<ffffffffc101452d>] _nv002814rm+0x51d/0x610 [nvidia]
[ 2065.284763] PGD 0
[ 2065.284765] Oops: 0000 [#1] SMP
[ 2065.284797] Modules linked in: nvidia_uvm(POE) nvidia(POE) arc4 md4 nls_utf8 cifs fscache rfcomm bnep bluetooth snd_hda_codec_hdmi intel_rapl snd_hda_intel snd_hda_controller iosf_mbi snd_hda_codec ast joydev ttm snd_hwdep x86_pkg_temp_thermal drm_kms_helper intel_powerclamp drm snd_pcm snd_seq_midi coretemp kvm_intel snd_seq_midi_event syscopyarea kvm snd_rawmidi sysfillrect snd_seq sysimgblt snd_seq_device snd_timer crct10dif_pclmul snd crc32_pclmul soundcore aesni_intelaes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd shpchp mei_me ipmi_ssifsb_edac mei edac_core acpi_power_meter lpc_ich 8250_fintek ipmi_si wmi acpi_pad ipmi_msghandler mac_hid parport_pc ppdev lp parport hid_generic igb i2c_algo_bit dca usbhid ptp hid pata_acpi pps_core [last unloaded: nvidia]
[ 2065.284801] CPU: 28 PID: 9672 Comm: luajit Tainted: POE 3.19.0-42 -generic #48~14.04.1-Ubuntu
[ 2065.284801] Hardware name: empty empty/FT77C-B7079, BIOS V1.03.B10 05/21/2015
[ 2065.284802] task: ffff880068a73ae0 ti: ffff8806ec6c8000 task.ti: ffff8806ec6c8000
[ 2065.284844] RIP: 0010:[<ffffffffc101452d>] [<ffffffffc101452d>] _nv002814rm+0x51d/0x610 [nvidia]
[ 2065.284845] RSP: 0018:ffff8806ec6cba58 EFLAGS: 00010246
[ 2065.284846] RAX: 0000000000000000 RBX: ffff88078ceac008 RCX: 0000000000000000
[ 2065.284846] RDX: 0000000000000000 RSI: 0000000000000011 RDI: 0000000000000000
[ 2065.284847] RBP: ffff8807aff7af58 R08: ffff880783051a30 R09: ffff8810583d6700
[ 2065.284847] R10: 00000000568240a4 R11: ffffffffc1430db0 R12: 0000000000000000
[ 2065.284848] R13: 0000000000000001 R14: 0000000000000001 R15: ffff880858937c08
[ 2065.284849] FS: 00007f1f2bbff700(0000) GS:ffff88085fbc0000(0000) knlGS:0000000000000000
[ 2065.284849] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2065.284850] CR2: 0000000000000020 CR3: 0000000001c16000 CR4: 00000000001407e0
[ 2065.284851] Stack:
[ 2065.284852] ffff88078ceac008 ffff88078328d008 ffff88085b5df808 0000000000000000
[ 2065.284853] ffff88078460c008 ffffffffc100e540 ffff88078328d008 ffff88078ceac008
[ 2065.284855] 0000000000001100 0000000000000000 0000000000000024 ffffffffc1005291
[ 2065.284855] Call Trace:
[ 2065.284899] [<ffffffffc100e540>] ? _nv003016rm+0xf0/0x1c0 [nvidia]
[ 2065.284938] [<ffffffffc1005291>] ? _nv003007rm+0x11/0x50 [nvidia]
[ 2065.285029] [<ffffffffc122a090>] ? _nv002020rm+0x2680/0x3c80 [nvidia]
[ 2065.285071] [<ffffffffc147b869>] ? _nv000654rm+0x2b9/0x340 [nvidia]
[ 2065.285111] [<ffffffffc1471e3a>] ? rm_disable_adapter+0x6a/0x130 [nvidia]
[ 2065.285152] [<ffffffffc148dff6>] ? nv_uvm_notify_stop_device+0x46/0x70 [nvidia]
[ 2065.285193] [<ffffffffc1482326>] ? nvidia_close+0x1a6/0x410 [nvidia]
[ 2065.285234] [<ffffffffc147fc7d>] ? nvidia_frontend_close+0x4d/0xa0 [nvidia]
[ 2065.285238] [<ffffffff811ee117>] ? __fput+0xe7/0x220
[ 2065.285240] [<ffffffff811ee29e>] ? ____fput+0xe/0x10
[ 2065.285243] [<ffffffff81091eac>] ? task_work_run+0xac/0xd0
[ 2065.285245] [<ffffffff810773f0>] ? do_exit+0x2c0/0xb00
[ 2065.285247] [<ffffffff81080ebf>] ? recalc_sigpending+0x1f/0x60
[ 2065.285248] [<ffffffff81077cbf>] ? do_group_exit+0x3f/0xa0
[ 2065.285250] [<ffffffff81083c30>] ? get_signal+0x1e0/0x710
[ 2065.285253] [<ffffffff81014e70>] ? do_signal+0x20/0x120
[ 2065.285294] [<ffffffffc147fcff>] ? nvidia_frontend_ioctl+0x2f/0x70 [nvidia]
[ 2065.285296] [<ffffffff811ffd28>] ? do_vfs_ioctl+0x2f8/0x510
[ 2065.285299] [<ffffffff810f10c1>] ? SyS_futex+0x71/0x150
[ 2065.285301] [<ffffffff81014fd9>] ? do_notify_resume+0x69/0xb0
[ 2065.285305] [<ffffffff817b79af>] ? int_signal+0x12/0x17
[ 2065.285317] Code: 16 75 00 31 c9 44 89 f2 be 2c 00 00 00 48 89 c7 ff 50 20 48 85 c0 49 89 c4 0f 84 d6 00 00 00 31 c9 31 d2 be 11 00 00 00 4c 89 e7 <41> ff 54 24 20 be 30 00 00 00 48 8b b8 a8 05 00 00 48 89 c3 ff
[ 2065.285358] RIP [<ffffffffc101452d>] _nv002814rm+0x51d/0x610 [nvidia]
[ 2065.285359] RSP <ffff8806ec6cba58>
[ 2065.285360] CR2: 0000000000000020
[ 2065.292853] ---[ end trace ff3f27fe0d1c23e2 ]---
[ 2065.292854] Fixing recursive fault but reboot is needed!
===
Thanks,
Kevin
#! /bin/bash
/usr/bin/apt-get update
/usr/bin/apt-get dist-upgrade
/usr/bin/apt-get install \
"linux-generic-lts-vivid" \
"linux-headers-generic-lts-vivid" \
"linux-image-generic-lts-vivid"
for i in 41 42 43 47
do
REMOVETHESE="${REMOVETHESE} linux-headers-3.19.0-${i} linux-headers-3.19.0-${i}-generic linux-image-3.19.0-${i}-generic linux-image-extra-3.19.0-${i}-generic"
done
/usr/bin/apt-get purge ${REMOVETHESE}
/usr/bin/apt-file update
/usr/bin/apt-get autoclean
/usr/bin/apt-get autoremove
/etc/update-motd.d/98-reboot-required
/bin/date
linux-generic-lts-vivid package?
Yeah, if you remove that (the meta package), and install 3.19.0-33-generic, then it'll stay at 33... But, I wanted to be able to keep testing to see if any new kernels would fix the problem... So, I was updating and checking when I'd see a new kernel release, and if it didn't, then I was just removing that _specific_ version (which if it's the latest, also removes the meta package...)
But, if you just remove the meta package, you're right, you don't have to do all of what I wrote above..
Binesh
Last comment from NVIDIA (Ryan L. - 02/14/2016 1:37 AM):
Hi Binesh,
The bug is fixed in our development driver. We'll inform you once the driver is official released.
Thanks,
Ryan
--
You received this message because you are subscribed to a topic in the Google Groups "torch7" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/torch7/kLusyLEj4oc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to torch7+un...@googlegroups.com.
BineshBut, thanks Vasili, I'll give that a shot... I'm tired of being stuck at the same linux version all this time.I should have posted this before, but I got this from someone on my bug report:Of course, it's been two weeks, and I still don't see a new cuda library... So...Last comment from NVIDIA (Ryan L. - 02/14/2016 1:37 AM):
Hi Binesh,
The bug is fixed in our development driver. We'll inform you once the driver is official released.
Thanks,
Ryan
Nicholas, what does it say if you do uname -a?I get: Linux gpu 3.19.0-33-generic #38~14.04.1-Ubuntu SMP Fri Nov 6 18:17:28 UTC 2015 x86_64 x86_64 x86_64 GNU/Linuxbasically, any linux above 3.19.0-33 has this issue. Are you using ubuntu?Binesh
for i in 41 42 43 47
which
is only removing linux versions 3.19.0-41, 3.19.0-42, 3.19.0-43 and
3.19.0-47. I did this on purpose, so that it would allow me to _try_
newer versions of linux as they came out. So, you have to modify my
script to also delete 3.19.0-51... So, just change that line to
for i in 41 42 43 47 51
and it should work again... You'll have to hard power down tho, to reboot unfortunately...
BTW,
the script is by _no means_ a fix, it is _simply_ a way to keep the
older version of linux around till nvidia gives us a new version of CUDA
that doesn't have this problem.
Binesh
Nicholas, could you check this for your case? Just prepend the run of th interpreter with variable:$CUDA_VISIBLE_DEVICES=0 th
Hi Binesh,
Linux driver version 361.28 should contain the fix of this issue already, please verify this problem again with v361.28 driver and let us know if it works. Thanks.
Download link for v361.28 driver: http://www.nvidia.com/download/driverResults.aspx/98373/en-us
Shimada, 352.79 is not the latest driver.. Try the 361.28 driver: http://www.nvidia.com/download/driverResults.aspx/98373/en-us .
--
Thank you for your advise. Then just simply update the driver with this http://www.nvidia.com/download/driverResults.aspx/98373/en-us ?There are no mention on the device support k40C so I thought it's only for geforce and other graphic cards. It seem it will work for K40C as well.I will try to update with that linked driver and let you know.
Sincerely,Sangpil Kim
Thanks Binesh!It worked!!!!!!!!My env was Ubuntu with K80.
On Sunday, April 10, 2016 at 2:05:17 PM UTC+9, sp wrote:
Thank you for your advise. Then just simply update the driver with this http://www.nvidia.com/download/driverResults.aspx/98373/en-us ?There are no mention on the device support k40C so I thought it's only for geforce and other graphic cards. It seem it will work for K40C as well.I will try to update with that linked driver and let you know.
Sincerely,Sangpil Kim