Hi all,
I'm trying to run the script mini_librispeech. but it fails on the last step while training the model using the script local/chain2/run_tdnn.sh
"cmd.sh" has been setting as follows:
export cuda_cmd="
run.pl --gpu 1"
I'm getting the following error:
run.pl: job failed, log is in exp/chain2/tdnn1a_sp/log/train.0.1.log
run.pl: job failed, log is in exp/chain2/tdnn1a_sp/log/train.0.2.log
This is the error log 'train.0.1.log':
nnet3-chain-train2 --out-of-range-regularize=0.01 --write-cache=exp/chain2/tdnn1a_sp/cache.1 --use-gpu=yes --apply-deriv-weights=false --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --max-param-change=2.0 --momentum=0.0 --l2-regularize-factor=0.5 --srand=0 'nnet3-copy --learning-rate=0.002 exp/chain2/tdnn1a_sp/0.raw - |' exp/chain2/tdnn1a_sp/egs/misc 'ark:nnet3-chain-copy-egs --frame-shift=1 scp:exp/chain2/tdnn1a_sp/egs/train.1.scp ark:- | nnet3-chain-shuffle-egs --buffer-size=1000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=256,128,64 ark:- ark:-|' exp/chain2/tdnn1a_sp/1.1.raw
WARNING (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 1 GPUs
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): NVIDIA GeForce GTX 760 free:876M, used:1121M, total:1998M, free/total:0.438757
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:501) Device: 0, mem_ratio: 0.438757
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:382) Trying to select device: 0
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 0 free mem ratio: 0.438757
ERROR (nnet3-chain-train2[5.5.989~1-66f5]:FinalizeActiveGpu():cu-device.cc:289) cublasStatus_t 3 : "CUBLAS_STATUS_ALLOC_FAILED" returned from 'cublasCreate(&cublas_handle_)'
[ Stack-Trace: ]
nnet3-chain-train2(kaldi::MessageLogger::LogMessage() const+0xb42) [0x563f7472c1b8]
nnet3-chain-train2(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x563f743aa76f]
nnet3-chain-train2(kaldi::CuDevice::FinalizeActiveGpu()+0x46a) [0x563f745c5cee]
nnet3-chain-train2(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xdfd) [0x563f745c73b5]
nnet3-chain-train2(main+0x483) [0x563f743a9b8d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f90b6ffcbf7]
nnet3-chain-train2(_start+0x2a) [0x563f743a962a]
This is the error log 'train.0.2.log':
nnet3-chain-train2 --out-of-range-regularize=0.01 --write-cache=exp/chain2/tdnn1a_sp/cache.1 --use-gpu=yes --apply-deriv-weights=false --leaky-hmm-coefficient=0.1 --xent-regularize=0.1 --print-interval=10 --max-param-change=2.0 --momentum=0.0 --l2-regularize-factor=0.5 --srand=0 'nnet3-copy --learning-rate=0.002 exp/chain2/tdnn1a_sp/0.raw - |' exp/chain2/tdnn1a_sp/egs/misc 'ark:nnet3-chain-copy-egs --frame-shift=2 scp:exp/chain2/tdnn1a_sp/egs/train.2.scp ark:- | nnet3-chain-shuffle-egs --buffer-size=1000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=256,128,64 ark:- ark:-|' exp/chain2/tdnn1a_sp/1.2.raw
WARNING (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 1 GPUs
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): NVIDIA GeForce GTX 760 free:878M, used:1119M, total:1998M, free/total:0.439757
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:501) Device: 0, mem_ratio: 0.439757
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuId():cu-device.cc:382) Trying to select device: 0
LOG (nnet3-chain-train2[5.5.989~1-66f5]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 0 free mem ratio: 0.439757
ERROR (nnet3-chain-train2[5.5.989~1-66f5]:FinalizeActiveGpu():cu-device.cc:289) cublasStatus_t 3 : "CUBLAS_STATUS_ALLOC_FAILED" returned from 'cublasCreate(&cublas_handle_)'
[ Stack-Trace: ]
nnet3-chain-train2(kaldi::MessageLogger::LogMessage() const+0xb42) [0x557c8ba5e1b8]
nnet3-chain-train2(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x557c8b6dc76f]
nnet3-chain-train2(kaldi::CuDevice::FinalizeActiveGpu()+0x46a) [0x557c8b8f7cee]
nnet3-chain-train2(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0xdfd) [0x557c8b8f93b5]
nnet3-chain-train2(main+0x483) [0x557c8b6dbb8d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f7353f37bf7]
nnet3-chain-train2(_start+0x2a) [0x557c8b6db62a]
Any help is appreciated