Problem on "train_pnorm_fast.sh" running eun_7a_960.sh on LIBRISPEECH with 2 NVIDIA tesla K40c CUDA boards

Piero Cosi

unread,

Dec 2, 2015, 10:59:13 AM12/2/15

to kaldi-help

Dear KALDI users,

I was able to reproduce LIBRISPEECH s5 recipe with a single NVIDIA Tesla K40c CUDA board
I set "use_gpu=true" variable
... and everything was PERFECTLY SMOOTH!

I installed now my second NVIDIA Tesla K40c CUDA board on the same server and I
tried to re-run the same LIBRISPEECH s5 recipe.

I tried directly the run_7a_960.sh script
but I got a problem while using "train_pnorm_fast.sh"
I attached the 2 log files!

I always set "use_gpu=true" variable
but in this case I set "num_jobs_nnet=2" variable
instead of "num_jobs_nnet=1" as in the first single board case!

it seems there is a problem getting the 2nd board running!?!?!

ANY HINTS?!?!?!

MANY THANKS in advance!
Piero

train.0.1.log   (this seems OK! ... see below for train.0.2.log)
# nnet-shuffle-egs --buffer-size=5000 --srand=0 ark:exp/nnet7a_960_gpu/egs/egs.1.0.ark ark:- | nnet-train-simple --minibatch-size=256 --srand=0 exp/nnet7a_960_gpu/0.mdl ark:- exp/nnet7a_960_gpu/1.1.mdl
# Started at Wed Dec 2 15:52:10 CET 2015
#
nnet-train-simple --minibatch-size=256 --srand=0 exp/nnet7a_960_gpu/0.mdl ark:- exp/nnet7a_960_gpu/1.1.mdl
nnet-shuffle-egs --buffer-size=5000 --srand=0 ark:exp/nnet7a_960_gpu/egs/egs.1.0.ark ark:-
LOG (nnet-train-simple:IsComputeExclusive():cu-device.cc:251) CUDA setup operating under Compute Exclusive Process Mode.
LOG (nnet-train-simple:FinalizeActiveGpu():cu-device.cc:213) The active GPU is [0]: Tesla K40c    free:11406M, used:112M, total:11519M, free/total:0.990203 version 3.5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.480316, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.433966, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.444137, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.437308, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.449816, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.444301, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.449457, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.455287, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.452486, for component index 5
LOG (nnet-train-simple:GetScalingFactor():nnet-component.cc:1914) Limiting step size using scaling factor 0.468474, for component index 5
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -7.84907 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -6.91705 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -6.30462 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -5.82287 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -5.44101 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -5.1134 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.83451 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.66147 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.43641 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.26758 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.13669 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -4.03057 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.93021 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.82303 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.76231 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.68759 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.62382 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.56905 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.53205 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.46379 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.42514 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.41068 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.36186 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.34475 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.29167 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.29596 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.25996 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.24725 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.21627 over 12800 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.19568 over 12800 frames.
LOG (nnet-shuffle-egs:main():nnet-shuffle-egs.cc:102) Shuffled order of 393060 neural-network training examples using a buffer (partial randomization)
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:187) Training objective function (this phase) is -3.15921 over 9060 frames.
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:202) Did backprop on 393060 examples, average log-prob per frame is -4.18436
LOG (nnet-train-simple:TrainNnetSimple():train-nnet.cc:205) [this line is to be parsed by a script:] log-prob-per-frame=-4.18436
LOG (nnet-train-simple:PrintProfile():cu-device.cc:415) -----
[cudevice profile]
CuVectorBase::ApplyFloor    0.235996s
GroupPnorm    0.243694s
AddMatVec    0.252531s
CuVector::Resize    0.258291s
Sum    0.2678s
MulColsVec    0.270175s
AddDiagVecMat    0.316658s
GroupPnormDeriv    0.36152s
ApplySoftMaxPerRow    0.363778s
SymAddMat2    0.734227s
AddDiagMatMat    0.738146s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)    0.808024s
CuMatrix::Resize    1.1938s
CuMatrix::SetZero    1.26555s
AddMatMat    12.1294s
Total GPU time:    21.1021s (may involve some double-counting)
-----
LOG (nnet-train-simple:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 79775184 bytes currently allocated (max: 81160528); 0 currently in use by user (max: 54112000); 80/106065 calls to Malloc* resulted in CUDA calls.
LOG (nnet-train-simple:PrintMemoryUsage():cu-allocator.cc:134) Time taken in cudaMallocPitch=0.00515485, in cudaMalloc=0.000409603, in cudaFree=0.000518322, in this->MallocPitch()=0.100768
LOG (nnet-train-simple:PrintMemoryUsage():cu-device.cc:388) Memory used (according to the device): 87040000 bytes.
LOG (nnet-train-simple:main():nnet-train-simple.cc:107) Finished training, processed 393060 training examples. Wrote model to exp/nnet7a_960_gpu/1.1.mdl
# Accounting: time=23 threads=1
# Ended (code 0) at Wed Dec 2 15:52:33 CET 2015, elapsed time 23 seconds

train.0.2.log
# nnet-shuffle-egs --buffer-size=5000 --srand=0 ark:exp/nnet7a_960_gpu/egs/egs.2.0.ark ark:- | nnet-train-simple --minibatch-size=256 --srand=0 exp/nnet7a_960_gpu/0.mdl ark:- exp/nnet7a_960_gpu/1.2.mdl
# Started at Wed Dec 2 15:52:10 CET 2015
#
nnet-train-simple --minibatch-size=256 --srand=0 exp/nnet7a_960_gpu/0.mdl ark:- exp/nnet7a_960_gpu/1.2.mdl
nnet-shuffle-egs --buffer-size=5000 --srand=0 ark:exp/nnet7a_960_gpu/egs/egs.2.0.ark ark:-
WARNING (nnet-train-simple:SelectGpuId():cu-device.cc:129) Will try again to get a GPU after 20 seconds.
KALDI_ASSERT: at nnet-train-simple:SelectGpuId:cu-device.cc:161, failed: cudaSuccess == cudaThreadSynchronize()
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)
kaldi::CuDevice::SelectGpuId(std::string)
nnet-train-simple(main+0x417) [0x692a0d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f31c19b1b45]
nnet-train-simple() [0x692529]
KALDI_ASSERT: at nnet-train-simple:SelectGpuId:cu-device.cc:161, failed: cudaSuccess == cudaThreadSynchronize()
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)
kaldi::CuDevice::SelectGpuId(std::string)
nnet-train-simple(main+0x417) [0x692a0d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f31c19b1b45]
nnet-train-simple() [0x692529]

# Accounting: time=20 threads=1
# Ended (code 255) at Wed Dec 2 15:52:30 CET 2015, elapsed time 20 seconds

Jan Trmal

unread,

Dec 2, 2015, 11:03:27 AM12/2/15

to kaldi-help

You probably didn't set the devices to compute exclusive mode. Does nvidia-smi list both devices and does show the compute exclusive mode?

Btw: how long the training takes on a single gpu?
Y.

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Piero Cosi

unread,

Dec 2, 2015, 11:16:14 AM12/2/15

to kaldi...@googlegroups.com

in effect ...NO!

Indeed nvidia-smi lists both devices and it does show

Compute Mode "Exclusive_Process" for both boards !!!!

CIAOOO

Piero

--
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/UcvANEp6dho/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

Piero Cosi

unread,

Dec 2, 2015, 12:32:27 PM12/2/15

to kaldi...@googlegroups.com

By the way ... I tried with

nvidia-smi -c 1 (exclusive_thread ... which is deprecated in DEBIAN JESSIE)

and

nvidia-smi -c 3 (exclusive_process)

on both boards ... but I have the same error!

Daniel Povey

unread,

Dec 2, 2015, 3:44:40 PM12/2/15

to kaldi-help

I just pushed a fix to the code about this.

There was a bug whereby if there was a failure getting a device context it was not printing out the error message that it should print out, and was instead hitting an assert statement.

It's likely that some other process is using the GPU- maybe a process you were debugging, or a process in another terminal, or your system's graphics. nvidia-smi lists the processes that are using each GPU and should tell you what process is accessing the GPU.

Dan

Piero Cosi

unread,

Dec 3, 2015, 6:10:30 AM12/3/15

to kaldi-help, dpo...@gmail.com

Dear Dan,

Many Thanks. I did what you suggested!
I PULL and and I recompiled KALDI

and before trying again ...
I checked my set-up

nvidia-smi correctly has 2 K40c
0, Tesla K40c
1, Tesla K40c

I have a variable
CUDA_VISIBLE_DEVICES=0,1
set for the 2 K40c boards

and in Librispeech

I tried now "run_5a_clean_100.sh"
(100 hour subset) ... to speed up!

instead of the full
data set ("run_7a_960.sh")

... as in your original "run_5a_clean_100.sh"

I set
num_jobs_nnet=4

and having 2 gpus I set:
parallel_opts="--gpu 2" ... I hope this is correct!   instead of "--gpu 2"
num_threads=1
minibatch_size=512
dir=exp/nnet5a_clean_100_gpu

The training script is now correctly running with 4 nnet_jobs ... but ONLY on 1 GPU
(The active GPU is [0] ... always the same!)

I checked the logs (attached) and they seems correct! ... a part from the fact they they always
say ... Selecting from 1 GPUs!!! but I have 2!

Probably I am missing something in my environment setting and I do something wrong!
... I do not know how to use also the 2nd GPU

ANY HINTS!?!?!?

by the way in your original
"run_5a_clean_100.sh"      there is         parallel_opts="--gpu 1"
"run_6a_clean_460.sh"      there is         parallel_opts="--gpu 1"
"run_7a_960.sh"                there is         parallel_opts="-l gpu=1"

are they all correct !??!

train.5.1.log

train.5.2.log

train.5.3.log

train.5.4.log

Message has been deleted

Piero Cosi

unread,

Dec 3, 2015, 12:00:19 PM12/3/15

to kaldi-help, dpo...@gmail.com

SORRY ... I just discovered with nvidia-smi that Compute Mode
on both GPUs was set to DEFAULT!

and NOW ... I am quite confused!!!!

The training was running on a single GPU but it was running
with 4 jobs (num_jobs_nnet=4)
apparently with no errors!

Was it wrong?!?! ... there were no apparent errors ... thus ...
why shuold I set Compute Mode to Exclusive_Process?!?!

by the way
if it is possible to run training while setting Compute Mode to Default
in each TESLA K40c there are 12GB of memory and one could
increase the number of jobs quite a bit?!?!

BUT PROBABLY I am not understanding a lot of things!!!

MANY THANKS for your HELP!

Jan Trmal

unread,

Dec 3, 2015, 12:01:24 PM12/3/15

to kaldi-help, Dan Povey

Piero, from what you've sent I don't think you have set the cards to compute-exclusive mode -- it should say "E. Thread" in the nvidia-smi output (not sure what is going on, as you've said you used the nvidia-smi -c 1 to change it).

Also the parameter "parallel_opts" should be set to

parallel_opts="--gpu 1"

(it's semantics is "number of gpus for individual SGE task, AFAIK)

Do not use the older switch "-l gpu=1" -- that was the old way of specifying the gpu use which was SGE dependent.

I still think there is something else that is missing because from the logs it seems that only the card id 0 is visible.

You shouldn't need CUDA_VISIBLE_DEVICES -- actually I wonder if the variable is set correctly to CUDA_VISIBLE_DEVICES=0,1 when you run the task -- should it be set to only CUDA_VISIBLE_DEVICES=0, it could explain why kaldi sees only one gpu

y.

Jan Trmal

unread,

Dec 3, 2015, 12:10:29 PM12/3/15

to kaldi-help, Dan Povey

Yes, if you have a sufficiently recent card, you can run multiple tasks on the same gpu. We usually discourage that, as there isn't probably any significant speed benefit (running two tasks in parallel takes about the same time as running them sequentially one after the other) and can lead to crashes where the gpu memory is exhausted + there is another bunch of issues when you do this in muti-user/multi-task environment.

y

Karel Veselý

unread,

Dec 3, 2015, 4:27:37 PM12/3/15

to kaldi-help, dpo...@gmail.com

Hi Piero,

yes, the reason why everything runs on GPU 0, is the use of 'Default' computation node.

Usually we rely on OS/Driver to select a free card, which is why we use the computation

exclusive mode. In the log there is a warning, which says that configuration should be changed.

Few years ago I did the experiment, when two jobs were using single GPU.

The run times were more than 2x slower than in the case of single process per GPU.

Best,

Karel.

Niko

unread,

Jun 7, 2016, 8:41:28 AM6/7/16

to kaldi-help, dpo...@gmail.com

The compute exclusive mode is unfortunately a bit error-prone, because if someone submits another gpu job on a multi-gpu machine that already has some gpu jobs running it may crash. If I understood it correctly, than the driver selects a free gpu card based on the available free memory. Do you think it is possible (or makes sense) to add a small delay between launching several gpu jobs on one machine to attain a more uniform utilization of multiple GPUs, even if the compute mode isn't set to exclusive?
Best,
Niko

Niko

unread,

Jun 7, 2016, 1:35:33 PM6/7/16

to kaldi-help, dpo...@gmail.com

Ok, I went through the code, in particular the cu-device.cc function, to figure out what's happening. It seems that each time a kaldi executable is called with a true gpu flag the gpu device is selected based on the free gpu memory (if the exclusive compute mode isn't active). Hence, if several jobs ask for free mem at the same time they may want all the same gpu. I made a modification to the run.pl script that puts a sleep of 5sec (would it be sufficient or too much?) between starting gpu jobs. I will test that change the next days when I have time and report the result here, unless someone tells me that it is a stupid idea or that there is maybe a better solution. :-)
Best,
Niko

Daniel Povey

unread,

Jun 7, 2016, 1:56:28 PM6/7/16

to Niko, kaldi-help

There is a better solution.
Firstly, the stuff about testing memory only happens if you don't use
compute exclusive mode, and that is highly deprecated.
We recommend to always use compute exclusive mode.
This will always work as long as you use a queueing system where
people have to reserve GPUs, search for gridengine in
http://kaldi-asr.org/doc/queue.html.
And if you use run.pl, just make sure that you never require more GPUs
than you have on the machine.
Dan

Niko

unread,

Jun 13, 2016, 6:41:26 AM6/13/16

to kaldi-help, morit...@googlemail.com, dpo...@gmail.com

Just to conclude this. My proposed solution didn't work well. So I will also just stick to the exclusive compute mode.

Daniel Povey

unread,

Jun 13, 2016, 2:35:46 PM6/13/16

to Niko, kaldi-help

BTW, the reason we always recommend this is that, even though there
may be instances where more than one job can fit into the GPU memory,
there tends to be an unexpectedly large decrease in speed, such that
it would really be better to run them sequentially.
Dan

Reply all

Reply to author

Forward