Problem with GPU training nnet

96 views
Skip to first unread message

sameerk...@gmail.com

unread,
Feb 5, 2016, 11:54:28 AM2/5/16
to kaldi-help
Hi,

When I run the command:

steps/train_nnet.sh --hid-dim 2048 --hid-layers 5 --learn-rate 0.008 data/train_mer90_tr90 data/train_mer90_cv10 data/lang exp/mer90/tri4_ali exp/mer90/tri4_ali exp/mer90/tri4_dnn_2048x5


I get the following terminal output:


### IS CUDA GPU AVAILABLE? 'sls-tesla-0' ###

WARNING (SelectGpuId():cu-device.cc:183) Suggestion: use 'nvidia-smi -c 1' to set compute exclusive mode

LOG (SelectGpuIdAuto():cu-device.cc:301) Selecting from 3 GPUs

WARNING (DeviceGetName():cu-device.cc:501) cannot open libcuda.so

WARNING (GetFreeMemory():cu-device.cc:456) cannot open libcuda.so

LOG (SelectGpuIdAuto():cu-device.cc:316) cudaSetDevice(0): Unknown GPU  free:0M, used:0M, total:0M, free/total:1

WARNING (DeviceGetName():cu-device.cc:501) cannot open libcuda.so

WARNING (GetFreeMemory():cu-device.cc:456) cannot open libcuda.so

LOG (SelectGpuIdAuto():cu-device.cc:316) cudaSetDevice(1): Unknown GPU  free:0M, used:0M, total:0M, free/total:1


GPU info: 

+------------------------------------------------------+                       

| NVIDIA-SMI 346.96     Driver Version: 346.96         |                       

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  Tesla C2075         Off  | 0000:06:00.0     Off |                    0 |

| 35%   84C    P0   107W / 225W |    319MiB /  5375MiB |     51%      Default |

+-------------------------------+----------------------+----------------------+

|   1  Tesla C2075         Off  | 0000:08:00.0     Off |                    0 |

| 30%   63C   P12    36W / 225W |     10MiB /  5375MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

|   2  Tesla C2075         Off  | 0000:82:00.0     Off |                    0 |

| 30%   68C    P0    88W / 225W |    301MiB /  5375MiB |     99%      Default |

+-------------------------------+----------------------+----------------------+

                                                                               

+-----------------------------------------------------------------------------+

| Processes:                                                       GPU Memory |

|  GPU       PID  Type  Process name                               Usage      |

|=============================================================================|

|    0     23723    C   python                                         302MiB |

|    2     28417    C   python                                         289MiB |

+-----------------------------------------------------------------------------+



OS info:


Linux version 3.13.0-65-generic (buildd@lgw01-55) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #105-Ubuntu SMP Mon Sep 21 18:50:58 UTC 2015


I am not sure what the problem could be. Any suggestions would be great!


More info:

Same piece of code was working fine on some other machine which had a different GPU version installed. This machine has a different version so I had to recompile and rebuild Kaldi.


Thanks

S

Daniel Povey

unread,
Feb 5, 2016, 12:57:42 PM2/5/16
to kaldi-help
It can't find the library libcuda.so, most likely because it's not on your LD_LIBRARY_PATH.  Do some reading about PATH and LD_LIBRARY_PATH.
Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages