Kaldi compiled with CUDA but wont run nnet models

653 views
Skip to first unread message

stefan....@gmail.com

unread,
Jun 10, 2022, 11:29:10 AM6/10/22
to kaldi-help
Hello,

I am building a model using the TedLium recipe and trying to run the nnet models with locan/chain/tdnn.sh and I consistently get this error.

local/chain/run_tdnn.sh cuda-compiled: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory This script is intended to be used with GPUs but you have not compiled Kaldi with CUDA If you want to use GPUs (and have them), go to src/, and configure and make on a machine where "nvcc" is installed.

the machine I am using is a cluster machine with CUDA 11.5. Kaldi is configured with the cuda and when checked it returns

grep -E "^CUDA\W" kaldi.mk 
CUDA = true

I also have the libcuda.so file located and added to LD_LIBRARY_PATH
find /home/shared/apps/cuda11.5 -name 'libcuda.s*' /home/shared/apps/cuda11.5/targets/x86_64-linux/lib/stubs/libcuda.so

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/shared/apps/cuda11.5/targets/x86_64-linux/lib/stubs/libcuda.so

Is it that Kaldi is looking for the specific libcuda.so.1 symlink file?

Any help would be appreciated

Best,
Stefan

Daniel Povey

unread,
Jun 12, 2022, 10:01:20 PM6/12/22
to kaldi-help
Use `ldd` on one of the Kaldi binaries that uses CUDA to see where it is trying to get the CUDA library from, it should be in the rpath.
It could be that the machine you compiled it on has it somewhere different than the one you are running it on.  You will have to mess with the
LD_LIBRARY_PATH in your .bashrc or .zshrc or whatever

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/30cb50bf-5bda-4c8d-b320-5776facf5f74n%40googlegroups.com.

Stefan Watson

unread,
Jul 7, 2022, 9:02:51 PM7/7/22
to kaldi-help
Thanks for the reply. tried the ldd command in the src folder

ldd nnet3bin/cuda-compiled

This is a snipped of the results below
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaab5e00000)
 libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaab6005000)
 libm.so.6 => /lib64/libm.so.6 (0x00002aaab6221000)
 libcuda.so.1 => /cm/local/apps/cuda/libs/current/lib64/libcuda.so.1 (0x00002aaab6523000)
 libcublas.so.9.0 => /home/shared/apps/cuda90/toolkit/9.0.176/lib64/libcublas.so.9.0 (0x00002aaab7c0c000) 
 libcusparse.so.9.0 => /home/shared/apps/cuda90/toolkit/9.0.176/lib64/libcusparse.so.9.0 (0x00002aaabb042000)
 libcusolver.so.9.0 => /home/shared/apps/cuda90/toolkit/9.0.176/lib64/libcusolver.so.9.0 (0x00002aaabe7a8000)
 libcudart.so.9.0 => /home/shared/apps/cuda90/toolkit/9.0.176/lib64/libcudart.so.9.0 (0x00002aaac33a4000)
 libcurand.so.9.0 => /home/shared/apps/cuda90/toolkit/9.0.176/lib64/libcurand.so.9.0 (0x00002aaac3611000)

I added that path the .bashrc 

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cm/local/apps/cuda/libs/current/lib64/libcuda.so.1

The file is present in the specified folder. However, I still get the same error that the file cannot be located

Is there anything else that I can try

Jan Yenda Trmal

unread,
Jul 7, 2022, 9:06:42 PM7/7/22
to kaldi-help
chm, if anything, the LD_LIBRARY_PATH should be

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cm/local/apps/cuda/libs/current/lib64

(you can also add this to the recipe's path.sh)
y.

Jan Yenda Trmal

unread,
Jul 7, 2022, 9:07:54 PM7/7/22
to kaldi-help
I'm not sure if that will help resolving your issue, tho.... Just the way you wrote it wasn't correct
y.

Sage Khan

unread,
Jul 27, 2022, 1:01:22 AM7/27/22
to kaldi-help
I had a similar type of issue
The error was as follows:

steps/online/nnet2/train_ivector_extractor.sh --cmd run.pl --nj 10 --num-processes 2 data/train_100h_sp_hires exp_gv_100h/nnet3_8000_160000/diag_ubm exp_gv_100h/nnet3_8000_160000/extractor
steps/online/nnet2/train_ivector_extractor.sh: doing Gaussian selection and posterior computation
Accumulating stats (pass 0)
Summing accs (pass 0)
Updating model (pass 0)
Accumulating stats (pass 1)
Summing accs (pass 1)
Updating model (pass 1)
Accumulating stats (pass 2)
Summing accs (pass 2)
Updating model (pass 2)
Accumulating stats (pass 3)
Summing accs (pass 3)
Updating model (pass 3)
Accumulating stats (pass 4)
Summing accs (pass 4)
Updating model (pass 4)
Accumulating stats (pass 5)
Summing accs (pass 5)
Updating model (pass 5)
Accumulating stats (pass 6)
Summing accs (pass 6)
Updating model (pass 6)
Accumulating stats (pass 7)
Summing accs (pass 7)
Updating model (pass 7)
Accumulating stats (pass 8)
Summing accs (pass 8)
Updating model (pass 8)
Accumulating stats (pass 9)
Summing accs (pass 9)
Updating model (pass 9)
local/chain/Run_ivector.sh: extracting iVectors for training data
utils/data/modify_speaker_info.sh: copied data from data/train_100h_sp_hires to exp_gv_100h/nnet3_8000_160000/ivectors_train_100h_sp_hires/train_100h_sp_hires_max2, number of speakers changed from 117 to 10095
utils/validate_data_dir.sh: Successfully validated data-directory exp_gv_100h/nnet3_8000_160000/ivectors_train_100h_sp_hires/train_100h_sp_hires_max2
steps/online/nnet2/extract_ivectors_online.sh --cmd run.pl --nj 60 exp_gv_100h/nnet3_8000_160000/ivectors_train_100h_sp_hires/train_100h_sp_hires_max2 exp_gv_100h/nnet3_8000_160000/extractor exp_gv_100h/nnet3_8000_160000/ivectors_train_100h_sp_hires
steps/online/nnet2/extract_ivectors_online.sh: extracting iVectors
run.pl: 60 / 60 failed, log is in exp_gv_100h/nnet3_8000_160000/ivectors_train_100h_sp_hires/log/extract_ivectors.*.log

$ cat exp_gv_100h/nnet3_8000_160000/ivectors_train_100h_sp_hires/log/extract_ivectors.*.log >> ivector-error.txt 

The issue turned out to be that CUDA was not detected by kaldi. Probably I updated it after compiling Kaldi. So the steps/online/nnet2/extract_ivectoers_online.sh was not running. 
ivector-extract-online2: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory it cannot find your cuda librarytry to run it from your command line:
. ./path.shivector-extract-online2

I went back to KALDI_ROOT/src and did the make process again. You can do simple ./configure or you can do ./configure --shared --use-cuda --cudatk-dir=/usr/local/cuda ... Then make clean, make depend and make.

Ensure Nvidia SMi set to exclusive compute mode instead of default mode
To check:
nvidia-smi  --query | grep 'Compute Mode'

To change:
sudo nvidia-smi -c 3

This fixed the issue

Regards

Reply all
Reply to author
Forward
0 new messages