[kaldi-help] Running mini_librispeech

75 views
Skip to first unread message

Lee

unread,
Nov 18, 2021, 5:09:39 AM11/18/21
to kaldi-help
Hi all,

I'm  trying to run the script mini_librispeech. but it fails on the last step while training the model using the script local/chain2/run_tdnn.sh

"cmd.sh" has been setting as follows:

export train_cmd=run.pl
export decode_cmd=run.pl
export mkgraph_cmd=run.pl
export cuda_cmd="run.pl --gpu 1"

I'm getting the following error:
run.pl: job failed, log is in exp/chain2/tdnn1a_sp/log/train.0.1.log
run.pl: job failed, log is in exp/chain2/tdnn1a_sp/log/train.0.2.log

This is the error log 'train.0.1.log':

nnet3-chain-train2 --out-of-range-regularize=0.01 --write-cache=exp/chain2/tdnn1a_sp/cache.1 --use-gpu=yes --apply-deriv-weights=false --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --max-param-change=2.0 --momentum=0.0 --l2-regularize-factor=0.5 --srand=0 'nnet3-copy --learning-rate=0.002 exp/chain2/tdnn1a_sp/0.raw - |' exp/chain2/tdnn1a_sp/egs/misc 'ark:nnet3-chain-copy-egs  --frame-shift=1 scp:exp/chain2/tdnn1a_sp/egs/train.1.scp ark:- | nnet3-chain-shuffle-egs --buffer-size=1000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs  --minibatch-size=256,128,64 ark:- ark:-|' exp/chain2/tdnn1a_sp/1.1.raw 
WARNING (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 1 GPUs
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): NVIDIA GeForce GTX 760 free:876M, used:1121M, total:1998M, free/total:0.438757
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:501) Device: 0, mem_ratio: 0.438757
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:382) Trying to select device: 0
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 0 free mem ratio: 0.438757
ERROR (nnet3-chain-train2[5.5.989~1-66f5]:FinalizeActiveGpu():cu-device.cc:289) cublasStatus_t 3 : "CUBLAS_STATUS_ALLOC_FAILED" returned from 'cublasCreate(&cublas_handle_)'

[ Stack-Trace: ]
nnet3-chain-train2(kaldi::MessageLogger::LogMessage() const+0xb42) [0x563f7472c1b8]
nnet3-chain-train2(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x563f743aa76f]
nnet3-chain-train2(kaldi::CuDevice::FinalizeActiveGpu()+0x46a) [0x563f745c5cee]
nnet3-chain-train2(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xdfd) [0x563f745c73b5]
nnet3-chain-train2(main+0x483) [0x563f743a9b8d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f90b6ffcbf7]
nnet3-chain-train2(_start+0x2a) [0x563f743a962a]


This is the error log 'train.0.2.log':

nnet3-chain-train2 --out-of-range-regularize=0.01 --write-cache=exp/chain2/tdnn1a_sp/cache.1 --use-gpu=yes --apply-deriv-weights=false --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --max-param-change=2.0 --momentum=0.0 --l2-regularize-factor=0.5 --srand=0 'nnet3-copy --learning-rate=0.002 exp/chain2/tdnn1a_sp/0.raw - |' exp/chain2/tdnn1a_sp/egs/misc 'ark:nnet3-chain-copy-egs  --frame-shift=2 scp:exp/chain2/tdnn1a_sp/egs/train.2.scp ark:- | nnet3-chain-shuffle-egs --buffer-size=1000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs  --minibatch-size=256,128,64 ark:- ark:-|' exp/chain2/tdnn1a_sp/1.2.raw 
WARNING (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 1 GPUs
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): NVIDIA GeForce GTX 760 free:878M, used:1119M, total:1998M, free/total:0.439757
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:501) Device: 0, mem_ratio: 0.439757
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:382) Trying to select device: 0
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 0 free mem ratio: 0.439757
ERROR (nnet3-chain-train2[5.5.989~1-66f5]:FinalizeActiveGpu():cu-device.cc:289) cublasStatus_t 3 : "CUBLAS_STATUS_ALLOC_FAILED" returned from 'cublasCreate(&cublas_handle_)'

[ Stack-Trace: ]
nnet3-chain-train2(kaldi::MessageLogger::LogMessage() const+0xb42) [0x557c8ba5e1b8]
nnet3-chain-train2(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x557c8b6dc76f]
nnet3-chain-train2(kaldi::CuDevice::FinalizeActiveGpu()+0x46a) [0x557c8b8f7cee]
nnet3-chain-train2(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xdfd) [0x557c8b8f93b5]
nnet3-chain-train2(main+0x483) [0x557c8b6dbb8d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f7353f37bf7]
nnet3-chain-train2(_start+0x2a) [0x557c8b6db62a]


Any help is appreciated





Jan-Willem van Leussen

unread,
Nov 18, 2021, 8:12:35 AM11/18/21
to kaldi...@googlegroups.com
Hi Lee,

The log you attached gives a suggestion:

WARNING (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode

If you set compute-exclusive mode with the command nvidia-smi -c 3, the chain training should hopefully continue successfully.

Jan


--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/5230f01e-e142-40d9-9222-7f3decd746b7n%40googlegroups.com.

Jan-Willem van Leussen

unread,
Nov 19, 2021, 7:50:32 AM11/19/21
to Lee, kaldi...@googlegroups.com
Hi Lee,

It's probably better to ask your follow-up question on kaldi-help again. My Kaldi knowledge is quite limited compared to that of the authors who frequent the forum :)

My best guess: your error may be caused by a CUDA version mismatch with Kaldi. You can test if your compilation with CUDA is working by running
make test in the src/ folder of your kaldi installation.

Jan-Willem

On Fri, Nov 19, 2021 at 11:40 AM Lee <liu...@gmail.com> wrote:
Thanks for your response!

i'v set compute-exclusive mode with the command nvidia-smi -c 3, but there are some errors as follows:

run.pl: job failed, log is in exp/chain2/tdnn1a_sp/log/train.0.2.log
run.pl: job failed, log is in exp/chain2/tdnn1a_sp/log/train.0.1.log
steps/chain2/train.sh: error detected training on iteration 0

train.0.1.log:
nnet3-chain-train2 --out-of-range-regularize=0.01 --write-cache=exp/chain2/tdnn1a_sp/cache.1 --use-gpu=yes --apply-deriv-weights=false --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --max-param-change=2.0 --momentum=0.0 --l2-regularize-factor=0.5 --srand=0 'nnet3-copy --learning-rate=0.002 exp/chain2/tdnn1a_sp/0.raw - |' exp/chain2/tdnn1a_sp/egs/misc 'ark:nnet3-chain-copy-egs  --frame-shift=1 scp:exp/chain2/tdnn1a_sp/egs/train.1.scp ark:- | nnet3-chain-shuffle-egs --buffer-size=1000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs  --minibatch-size=256,128,64 ark:- ark:-|' exp/chain2/tdnn1a_sp/1.1.raw 
WARNING (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:197) Will try again to get a GPU after 20 seconds.
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:238) CUDA setup operating under Compute Exclusive Mode.
ERROR (nnet3-chain-train2[5.5.989~1-66f5]:FinalizeActiveGpu():cu-device.cc:289) cublasStatus_t 3 : "CUBLAS_STATUS_ALLOC_FAILED" returned from 'cublasCreate(&cublas_handle_)'

[ Stack-Trace: ]
nnet3-chain-train2(kaldi::MessageLogger::LogMessage() const+0xb42) [0x561bc358a1b8]
nnet3-chain-train2(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x561bc320876f]
nnet3-chain-train2(kaldi::CuDevice::FinalizeActiveGpu()+0x46a) [0x561bc3423cee]
nnet3-chain-train2(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xc17) [0x561bc34251cf]
nnet3-chain-train2(main+0x483) [0x561bc3207b8d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f458d4fcbf7]
nnet3-chain-train2(_start+0x2a) [0x561bc320762a]


train.0.2.log:
nnet3-chain-train2 --out-of-range-regularize=0.01 --write-cache=exp/chain2/tdnn1a_sp/cache.1 --use-gpu=yes --apply-deriv-weights=false --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --max-param-change=2.0 --momentum=0.0 --l2-regularize-factor=0.5 --srand=0 'nnet3-copy --learning-rate=0.002 exp/chain2/tdnn1a_sp/0.raw - |' exp/chain2/tdnn1a_sp/egs/misc 'ark:nnet3-chain-copy-egs  --frame-shift=2 scp:exp/chain2/tdnn1a_sp/egs/train.2.scp ark:- | nnet3-chain-shuffle-egs --buffer-size=1000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs  --minibatch-size=256,128,64 ark:- ark:-|' exp/chain2/tdnn1a_sp/1.2.raw 
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:238) CUDA setup operating under Compute Exclusive Mode.
ERROR (nnet3-chain-train2[5.5.989~1-66f5]:FinalizeActiveGpu():cu-device.cc:289) cublasStatus_t 3 : "CUBLAS_STATUS_ALLOC_FAILED" returned from 'cublasCreate(&cublas_handle_)'

[ Stack-Trace: ]
nnet3-chain-train2(kaldi::MessageLogger::LogMessage() const+0xb42) [0x55b00a4991b8]
nnet3-chain-train2(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x55b00a11776f]
nnet3-chain-train2(kaldi::CuDevice::FinalizeActiveGpu()+0x46a) [0x55b00a332cee]
nnet3-chain-train2(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xc17) [0x55b00a3341cf]
nnet3-chain-train2(main+0x483) [0x55b00a116b8d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f90fb817bf7]
nnet3-chain-train2(_start+0x2a) [0x55b00a11662a]

Reply all
Reply to author
Forward
0 new messages