CURAND_STATUS_LAUNCH_FAILURE using new Titan X Pascal

610 views
Skip to first unread message

David Cofer

unread,
Nov 12, 2016, 7:47:40 AM11/12/16
to DIGITS Users
I have been using Digits for a while now with a GeForce GTX 960. I wanted some more power so I shelled out a bunch of money for a new Titan X Pascal GPU. I installed it in the secondary PCI-e slot so I could use the GTX primarily for graphics, and use the Titan for GPU. I can run deviceQuery and see both GPUs. I have also run the bandwidth test on the new Titan and it passed, so the GPU is working. I am currently using digits version 4.1-dev, and caffe 0.15.9, and I was able to figure out how to change the digits config file to set it up so I could use both of them. However, when I clone any of my previous jobs that ran fine on the GTX and try and run them on the Titan I get a CUDA error.

I1112 06:19:01.239487  5120 solver.cpp:362] Iteration 0, Testing net (#0)
I1112 06:19:04.057446  5120 blocking_queue.cpp:50] Data layer prefetch queue empty
I1112 06:23:06.287995  5120 solver.cpp:429]     Test net output #0: accuracy = 0.0070898
I1112 06:23:06.288074  5120 solver.cpp:429]     Test net output #1: loss = 3.04448 (* 1 = 3.04448 loss)
F1112 06:23:06.447123  5120 math_functions.cu:396] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0)  CURAND_STATUS_LAUNCH_FAILURE
*** Check failure stack trace: ***
@     0x7fa1f5014daa  (unknown)
@     0x7fa1f5014ce4  (unknown)
@     0x7fa1f50146e6  (unknown)
@     0x7fa1f5017687  (unknown)
@     0x7fa1f5737cd4  (unknown)
@     0x7fa1f5767f55  (unknown)
@     0x7fa1f56d87d8  (unknown)
@     0x7fa1f56d8b57  (unknown)
@     0x7fa1f57116fc  (unknown)
@     0x7fa1f5711fce  (unknown)
@           0x40af36  (unknown)
@           0x40867c  (unknown)
@     0x7fa1f3b17f45  (unknown)
@           0x408e4d  (unknown)
@              (nil)  (unknown)

This is a very simple test network I am using just to see if things are running correctly. All jobs I have tried to run on the Titan eventually fail this way after churning for a while, but If I clone it and run it again on the GTX it runs perfectly fine. Does anyone have any ideas on why this is not running on my new, and very expensive GPU card, but runs fine on the older, cheaper one? Does anyone have any suggestions on how I can get some more info on why it is failing? I have attached the caffe log.

Thanks,
David


caffe_output (4).log

David Cofer

unread,
Nov 12, 2016, 7:57:06 AM11/12/16
to DIGITS Users
After some more research I believe this may be happening because I still have CUDA 7.5 installed. I am going to try and upgrade to 8.0 and see if that fixes my issue.

David Cofer

unread,
Nov 13, 2016, 7:46:25 AM11/13/16
to DIGITS Users
So that did not work. I had to upgrade pretty much everything. I switched over to CUDA 8 and upgraded to cuDNN 5.1, and then I had to rebuild the latest opencv with CUDA 8. I pulled the latest nvidia/caffe and rebuilt and installed it. Since Digits 5.0 was just released with semantic segmentation I pulled that from github and used it instead of the 4.1-dev I was using. However, I have the exact same problem. I can run semantic segmentation tasks on digits 5.0 just fine with my older GTX card, but when I clone that job and try and run it on the new Titan X it fails with a CURAND_STATUS_SUCCESS (201 vs. 0)  CURAND_STATUS_LAUNCH_FAILURE. I am stumped on this and I hope someone might have some suggestions or other things to try.

Andrew Janowczyk

unread,
Nov 17, 2016, 3:57:54 PM11/17/16
to DIGITS Users
Hi David,

Any luck? I just installed the same GPU and have the same issue.

The machine has 2 other GPUs, a Tesla K20 and older non-pascal GeForce Titan X, which are still working fine, but the Titan Pascal sits idly by.

I went into the /opt/caffe-nv directory and did a "make runtest" and it also hangs when using the new Titan Pascal (see attached)

I also tried with the office caffe branch, and same issue.

Likely not a digits or caffe problem, but a driver or hardware issue?

Cheers,
Andrew
titanx-pascal.png

Greg Heinrich

unread,
Nov 17, 2016, 4:29:09 PM11/17/16
to Andrew Janowczyk, DIGITS Users
I notice that people who have this issue have several GPUs from different generations. Is it worth trying to make only the Pascal GPUs "visible" as a quick experiment? You can do this by setting CUDA_VISIBLE_DEVICES environment variable.

--
You received this message because you are subscribed to the Google Groups "DIGITS Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digits-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digits-users/296abfe3-aec4-4605-bc8d-80e2fb3f6d4d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Andrew Janowczyk

unread,
Nov 17, 2016, 4:50:43 PM11/17/16
to Greg Heinrich, DIGITS Users
yea, I tried that, as well as setting  "nvidia-smi -c 2"  (compute prohibited)  for the other 2 cards, but no dice :-\

Whats odd is that when running the tests occasionally you can see the gpu usage deviate from 0%, but that particular test runs very slowly, and then a few tests later it crashes again. Of course the tests are randomly shuffled, but from what i can tell from a few iterations, the test that crashes is random. The suite has never run to completion.

I'm going to attempt to re-install the drivers/sdk and recompile everything and we'll see if that works. It'll take a few hours so I was hoping someone had stumbled across a more elegant solution...c'est la vie


Andrew Janowczyk

unread,
Nov 18, 2016, 8:59:02 AM11/18/16
to DIGITS Users, gregory....@gmail.com

In the end, i re-downloaded and installed cuda 8.0 and cudnn, then recompiled nccl and caffe and things are working beauuuuuutifully 

Cheres,
Andrew
Reply all
Reply to author
Forward
0 new messages