Hi,
I'm trying to run the WSJ example scripts for nnet2 and everything runs fine until it gets to steps/nnet2/train_multisplice_accel2.sh.
It gets as far as Training neural net (pass 0) and never gets any further.
Looking at the logs I can see:
WARNING (nnet-train-simple:SelectGpuId():cu-device.cc:137) Will try again to get a GPU after 20 seconds.
Wed Mar 1 15:31:39 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:02:00.0 On | N/A |
| 27% 39C P8 9W / 180W | 562MiB / 8104MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1203 G /usr/lib/xorg/Xorg 311MiB |
| 0 1396 C nnet-train-simple 139MiB |
| 0 4417 G compiz 108MiB |
+-----------------------------------------------------------------------------+
LOG (nnet-train-simple:SelectGpuId():cu-device.cc:146) num-gpus=1. Device 0: all CUDA-capable devices are busy or unavailable.
ERROR (nnet-train-simple:SelectGpuId():cu-device.cc:147) Failed to create CUDA context, no more unused GPUs?
[ Stack-Trace: ]
nnet-train-simple() [0x9b7620]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
main
__libc_start_main
_start
and when I qstat the job I see:
Scheduling info: (-l arch=*64*,gpu=1) cannot run at host "<hostName>" because it offers only hc:gpu=0.000000
Using qconf I can see that gpu and mem_free are both set as per the instructions found here:
http://kaldi-asr.org/doc/queue.htmlI ran the tests in cudamatrix and the all report back as "SUCCESS".
I've tried changing the compute value in
kaldi.mk and recompiling but this did not seem to make any difference.
I also tried running sudo chmod a+rwx /dev/nvidia* but I still see the same problems in the logs.
How can I solve this problem?