Dear KALDI users,
I was able to reproduce LIBRISPEECH s5 recipe with a single NVIDIA Tesla K40c CUDA board
I set "use_gpu=true" variable
... and everything was PERFECTLY SMOOTH!
I installed now my second NVIDIA Tesla K40c CUDA board on the same server and I
tried to re-run the same LIBRISPEECH s5 recipe.
I tried directly the run_7a_960.sh script
but I got a problem while using "train_pnorm_fast.sh"
I attached the 2 log files!
I always set "use_gpu=true" variable
but in this case I set "num_jobs_nnet=2" variable
instead of "num_jobs_nnet=1" as in the first single board case!
it seems there is a problem getting the 2nd board running!?!?!
ANY HINTS?!?!?!
MANY THANKS in advance!
Piero
train.0.1.log (this seems OK! ... see below for train.0.2.log)
# nnet-shuffle-egs --buffer-size=5000 --srand=0 ark:exp/nnet7a_960_gpu/egs/egs.1.0.ark ark:- | nnet-train-simple --minibatch-size=256 --srand=0 exp/nnet7a_960_gpu/0.mdl ark:- exp/nnet7a_960_gpu/1.1.mdl
# Started at Wed Dec 2 15:52:10 CET 2015
#
nnet-train-simple --minibatch-size=256 --srand=0 exp/nnet7a_960_gpu/0.mdl ark:- exp/nnet7a_960_gpu/1.1.mdl
nnet-shuffle-egs --buffer-size=5000 --srand=0 ark:exp/nnet7a_960_gpu/egs/egs.1.0.ark ark:-
LOG (nnet-train-simple:IsComputeExclusive():cu-device.cc:251) CUDA setup operating under Compute Exclusive Process Mode.
LOG (nnet-train-simple:FinalizeActiveGpu():cu-device.cc:213) The active GPU is [0]: Tesla K40c free:11406M, used:112M, total:11519M, free/total:0.990203 version 3.5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.480316, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.433966, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.444137, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.437308, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.449816, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.444301, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.449457, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.455287, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.452486, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.468474, for component index 5
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -7.84907 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -6.91705 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -6.30462 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -5.82287 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -5.44101 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -5.1134 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.83451 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.66147 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.43641 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.26758 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.13669 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.03057 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.93021 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.82303 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.76231 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.68759 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.62382 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.56905 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.53205 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.46379 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.42514 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.41068 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.36186 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.34475 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.29167 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.29596 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.25996 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.24725 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.21627 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.19568 over 12800 frames.
LOG (nnet-shuffle-egs:main():nnet-shuffle-egs.cc:102) Shuffled order of 393060 neural-network training examples using a buffer (partial randomization)
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.15921 over 9060 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:202) Did backprop on 393060 examples, average log-prob per frame is -4.18436
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:205) [this line is to be parsed by a script:] log-prob-per-frame=-4.18436
LOG (nnet-train-simple:PrintProfile():cu-device.cc:415) -----
[cudevice profile]
CuVectorBase::ApplyFloor 0.235996s
GroupPnorm 0.243694s
AddMatVec 0.252531s
CuVector::Resize 0.258291s
Sum 0.2678s
MulColsVec 0.270175s
AddDiagVecMat 0.316658s
GroupPnormDeriv 0.36152s
ApplySoftMaxPerRow 0.363778s
SymAddMat2 0.734227s
AddDiagMatMat 0.738146s
CuMatrixBase::CopyFromMat(from other CuMatrixBase) 0.808024s
CuMatrix::Resize 1.1938s
CuMatrix::SetZero 1.26555s
AddMatMat 12.1294s
Total GPU time: 21.1021s (may involve some double-counting)
-----
LOG (nnet-train-simple:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 79775184 bytes currently allocated (max: 81160528); 0 currently in use by user (max: 54112000); 80/106065 calls to Malloc* resulted in CUDA calls.
LOG (nnet-train-simple:PrintMemoryUsage():cu-allocator.cc:134) Time taken in cudaMallocPitch=0.00515485, in cudaMalloc=0.000409603, in cudaFree=0.000518322, in this->MallocPitch()=0.100768
LOG (nnet-train-simple:PrintMemoryUsage():cu-device.cc:388) Memory used (according to the device): 87040000 bytes.
LOG (nnet-train-simple:main():nnet-train-simple.cc:107) Finished training, processed 393060 training examples. Wrote model to exp/nnet7a_960_gpu/1.1.mdl
# Accounting: time=23 threads=1
# Ended (code 0) at Wed Dec 2 15:52:33 CET 2015, elapsed time 23 seconds
train.0.2.log
# nnet-shuffle-egs --buffer-size=5000 --srand=0 ark:exp/nnet7a_960_gpu/egs/egs.2.0.ark ark:- | nnet-train-simple --minibatch-size=256 --srand=0 exp/nnet7a_960_gpu/0.mdl ark:- exp/nnet7a_960_gpu/1.2.mdl
# Started at Wed Dec 2 15:52:10 CET 2015
#
nnet-train-simple --minibatch-size=256 --srand=0 exp/nnet7a_960_gpu/0.mdl ark:- exp/nnet7a_960_gpu/1.2.mdl
nnet-shuffle-egs --buffer-size=5000 --srand=0 ark:exp/nnet7a_960_gpu/egs/egs.2.0.ark ark:-
WARNING (nnet-train-simple:SelectGpuId():cu-device.cc:129) Will try again to get a GPU after 20 seconds.
KALDI_ASSERT: at nnet-train-simple:SelectGpuId:cu-device.cc:161, failed: cudaSuccess == cudaThreadSynchronize()
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)
kaldi::CuDevice::SelectGpuId(std::string)
nnet-train-simple(main+0x417) [0x692a0d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f31c19b1b45]
nnet-train-simple() [0x692529]
KALDI_ASSERT: at nnet-train-simple:SelectGpuId:cu-device.cc:161, failed: cudaSuccess == cudaThreadSynchronize()
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)
kaldi::CuDevice::SelectGpuId(std::string)
nnet-train-simple(main+0x417) [0x692a0d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f31c19b1b45]
nnet-train-simple() [0x692529]
# Accounting: time=20 threads=1
# Ended (code 255) at Wed Dec 2 15:52:30 CET 2015, elapsed time 20 seconds