Failed to create CUDA context when using GTX 1080

1,967 views
Skip to first unread message

PhilMC

unread,
Mar 8, 2017, 10:09:41 AM3/8/17
to kaldi-help
Hi,

I'm trying to run the WSJ example scripts for nnet2 and everything runs fine until it gets to steps/nnet2/train_multisplice_accel2.sh.
It gets as far as Training neural net (pass 0) and never gets any further.

Looking at the logs I can see:

WARNING (nnet-train-simple:SelectGpuId():cu-device.cc:137) Will try again to get a GPU after 20 seconds.
Wed Mar  1 15:31:39 2017      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0      On |                  N/A |
| 27%   39C    P8     9W / 180W |    562MiB /  8104MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1203    G   /usr/lib/xorg/Xorg                             311MiB |
|    0      1396    C   nnet-train-simple                              139MiB |
|    0      4417    G   compiz                                         108MiB |
+-----------------------------------------------------------------------------+
LOG (nnet-train-simple:SelectGpuId():cu-device.cc:146) num-gpus=1. Device 0: all CUDA-capable devices are busy or unavailable. 
ERROR (nnet-train-simple:SelectGpuId():cu-device.cc:147) Failed to create CUDA context, no more unused GPUs?

[ Stack-Trace: ]
nnet-train-simple() [0x9b7620]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
main
__libc_start_main
_start

and when I qstat the job I see:
Scheduling info:            (-l arch=*64*,gpu=1) cannot run at host "<hostName>" because it offers only hc:gpu=0.000000

Using qconf I can see that  gpu and mem_free are both set as per the instructions found here: http://kaldi-asr.org/doc/queue.html

I ran the tests in cudamatrix and the all report back as "SUCCESS".
I've tried changing the compute value in kaldi.mk and recompiling but this did not seem to make any difference.

I also tried running sudo chmod a+rwx /dev/nvidia* but I still see the same problems in the logs.


How can I solve this problem?

Daniel Povey

unread,
Mar 8, 2017, 11:38:21 AM3/8/17
to kaldi-help
Possibly it's trying to use more than one GPU (because --num-jobs-initial > 1) and you only have one.  You could change --num-jobs-{initial,final} to 1 in the calling script (it will change results though).
Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

PhilMC

unread,
Mar 9, 2017, 4:27:24 AM3/9/17
to kaldi-help
Thanks for the quick response.

I've tried setting --num-jobs-initial & --num-jobs-final to 1 and running the run.sh script again. As before it gets to Training neural net (pass 0) and no further. Looking at the job its giving the same message as before. What else could I try?
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Galvez

unread,
Mar 9, 2017, 4:37:51 AM3/9/17
to kaldi-help
I noticed you're in exclusive process compute mode, meaning that only one process can acquire a GPU at a time. I also noticed that you have two other processes apparently using that GPU. That strikes me as odd. I suggest you try disabling exclusive compute mode.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Daniel Galvez

PhilMC

unread,
Mar 9, 2017, 7:19:22 AM3/9/17
to kaldi-help
I ran sudo nvidia-smi -c 0 which resulted in the message:
Compute mode is already set to DEFAULT for GPU 0000:02:00.0.
All done.

I then ran nvidia-smi which shows:
Thu Mar  9 12:04:45 2017      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0      On |                  N/A |
| 29%   46C    P2    40W / 180W |    657MiB /  8113MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1211    G   /usr/lib/xorg/Xorg                             400MiB |
|    0      1373    C   nnet-train-simple                              139MiB |
|    0      1885    G   compiz                                         113MiB |
+-----------------------------------------------------------------------------+

I checked the job with qstat and it still says "cannot run at host "HostName" because it offers only hc:gpu=0.000000"
Daniel Galvez

Daniel Povey

unread,
Mar 9, 2017, 12:26:57 PM3/9/17
to kaldi-help
So the issue about not being able to grab your GPU was because your computer graphics were using the GPU.
But the problem with the 'gpu' resource is a GridEngine configuration issue, see
(search for 'gpu' in the section 'Configuring GridEngine' for info on that).

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages