chain tdnn training GPU memory

658 views
Skip to first unread message

David van Leeuwen

unread,
Oct 25, 2017, 4:33:48 AM10/25/17
to kaldi-help
Hello, 

It seems I am running out of memory on the GPU (a poor man's GTX 1080) in nnet3-chain-train, but I don't really understand why, from the numbers in the logfile.  There should be about 8 GB available, but `nnet3-chain-train` bails out allocating about 320 MB while about 2600 MB has already been allocated, if I interpret the logs correctly.  The relevant lines from the log are:

LOG (nnet3-chain-train[5.2.160~2-51042]:IsComputeExclusive():cu-device.cc:263) CUDA setup operating under Compute Exclusive Process Mode.
LOG (nnet3-chain-train[5.2.160~2-51042]:FinalizeActiveGpu():cu-device.cc:225) The active GPU is [0]: GeForce GTX 1080   free:7966M, used:147M, total:8113M, free/total:0.981882 version 6.1
LOG (nnet3-chain-train[5.2.160~2-51042]:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 4047244608 bytes currently allocated (max: 4543456848); 2321673216 currently in use by user (max: 3029200088); 1721/13448 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.2.160~2-51042]:PrintMemoryUsage():cu-allocator.cc:134) Time taken in cudaMallocPitch=-1.10451e+17, in cudaMalloc=-1.25219e+16, in cudaFree=8.43331e+11, in this->MallocPitch()=-1.45745e+20
WARNING (nnet3-chain-train[5.2.160~2-51042]:MallocPitchInternal():cu-allocator.cc:97) Allocation of 6968320 x 48 region failed: freeing some memory and trying again. 
LOG (nnet3-chain-train[5.2.160~2-51042]:MallocPitchInternal():cu-allocator.cc:102) To avoid future problems like this, changing memory_factor from 1.5 to 1.1
LOG (nnet3-chain-train[5.2.160~2-51042]:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 3116352832 bytes currently allocated (max: 4543456848); 2321673216 currently in use by user (max: 3029200088); 1721/13448 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.2.160~2-51042]:PrintMemoryUsage():cu-allocator.cc:134) Time taken in cudaMallocPitch=-1.10591e+17, in cudaMalloc=-1.25219e+16, in cudaFree=-9.3372e+13, in this->MallocPitch()=-1.45745e+20
WARNING (nnet3-chain-train[5.2.160~2-51042]:MallocPitchInternal():cu-allocator.cc:97) Allocation of 6968320 x 48 region failed: freeing some memory and trying again. 
LOG (nnet3-chain-train[5.2.160~2-51042]:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 2701251904 bytes currently allocated (max: 4543456848); 2321673216 currently in use by user (max: 3029200088); 1721/13448 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.2.160~2-51042]:PrintMemoryUsage():cu-allocator.cc:134) Time taken in cudaMallocPitch=-1.10732e+17, in cudaMalloc=-1.25219e+16, in cudaFree=-1.87587e+14, in this->MallocPitch()=-1.45745e+20
ERROR (nnet3-chain-train[5.2.160~2-51042]:MallocPitchInternal():cu-allocator.cc:114) Cannot allocate the requested memory (6968320 x 48 = 334479360 bytes)

Where would I configure the `memory_factor`, as suggested?  

For now, I've reduced the largest number in `--trainer.num-chunk-per-minibatch`, and that seems to help. 

Cheers, 

---david

Daniel Povey

unread,
Oct 25, 2017, 1:26:18 PM10/25/17
to kaldi-help
Hm.
In the past I haven't been too careful about correlating those numbers
with the amount of memory the GPU claims to have available. I suspect
the discrepancy has to do partly with how "cudaMallocPitch" works (it
may leave largish gaps between rows of matrices), partly due to
overhead from the way CUDA's memory allocation works, and maybe partly
due to fragmentation.

I think the easiest fix for you is just to change the
num-chunk-per-minibatch, like you have done. You don't have to mess
with the memory_factor.

Something that bothers me a bit is that it's printing negative numbers
for the time taken. I can't see in the code, how that would be
possible. I suspect that it's a problem that could be resolved by
doing "make depend" and then "make". I'd like to know if that fixes
it.

At some point someone may want to look into this more deeply and see
which of those 3 factors (gaps/overhead/fragmentation) is responsible
for the discrepancy. The "gaps" choice would be easy to get a handle
on by simply adding more diagnostics in cu-allocator.{h,cc}. If you
wanted to do that, it would be great.

Dan
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/d16398e2-7a1a-4124-a0a1-ef54a86964c1%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

David van Leeuwen

unread,
Oct 25, 2017, 4:08:04 PM10/25/17
to kaldi-help
Hi, 

Thanks for the explanations.  After the current training has finished, I'll verify the `make depend; make` and try to reproduce the problem.  

Looking at the dates on the filesystem, the .o files after the build are all just a few minutes younger than the .depend.mk files, except for a few .depend.mk files in directories like `src/sgmmbin`---I don't know the details of the build process, so I don't know if this can be due to the fact that nothing changed since my last build.  I guess I did do a `make depend; make`. 

---david

David van Leeuwen

unread,
Oct 27, 2017, 9:53:05 AM10/27/17
to kaldi-help
Hi, 

On Wednesday, October 25, 2017 at 10:08:04 PM UTC+2, David van Leeuwen wrote:
Hi, 

Thanks for the explanations.  After the current training has finished, I'll verify the `make depend; make` and try to reproduce the problem.  

Looking at the dates on the filesystem, the .o files after the build are all just a few minutes younger than the .depend.mk files, except for a few .depend.mk files in directories like `src/sgmmbin`---I don't know the details of the build process, so I don't know if this can be due to the fact that nothing changed since my last build.  I guess I did do a `make depend; make`. 

---david

On Wednesday, October 25, 2017 at 7:26:18 PM UTC+2, Dan Povey wrote:
Hm.
In the past I haven't been too careful about correlating those numbers
with the amount of memory the GPU claims to have available.  I suspect
the discrepancy has to do partly with how "cudaMallocPitch" works (it
may leave largish gaps between rows of matrices), partly due to
overhead from the way CUDA's memory allocation works, and maybe partly
due to fragmentation.

I think the easiest fix for you is just to change the
num-chunk-per-minibatch, like you have done.  You don't have to mess
with the memory_factor.

Something that bothers me a bit is that it's printing negative numbers
for the time taken.  I can't see in the code, how that would be
possible.  I suspect that it's a problem that could be resolved by
doing "make depend" and then "make".  I'd like to know if that fixes
it.

OK,

[ Restarting the training was a bit painful because all the egs had to be recreated, I should originally have said `--remove-egs=false`. 
I did `git pull`, a `make depend; make`, and ran into the same memory problem at the same location, with the same negative time intervals.
I re-installed the latest nvidia driver 384.90, same problems.
Then I did a `make clean; make depend; make` for kaldi binaries just to be sure.  Same problems. ]

I've had a look at the code, it appears that the CuTimer class only gives sensible results if the verbose level is >= 1.  To check, I had to hack this rather ugly in `acoustic_model.py`, then I get normal numbers for timing:

LOG (nnet3-chain-train[5.2.179~2-41301]:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 4115613936 bytes currently allocated (max: 4543694392); 2631947264 currently in use by user (max: 3029200088); 1504/12350 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.2.179~2-41301]:PrintMemoryUsage():cu-allocator.cc:134) Time taken in cudaMallocPitch=0.583654, in cudaMalloc=0.00994802, in cudaFree=0.213659, in this->MallocPitch()=0.875278
WARNING (nnet3-chain-train[5.2.179~2-41301]:MallocPitchInternal():cu-allocator.cc:97) Allocation of 6968320 x 55 region failed: freeing some memory and trying again.
LOG (nnet3-chain-train[5.2.179~2-41301]:MallocPitchInternal():cu-allocator.cc:102) To avoid future problems like this, changing memory_factor from 1.5 to 1.1
LOG (nnet3-chain-train[5.2.179~2-41301]:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 3368208344 bytes currently allocated (max: 4543694392); 2631947264 currently in use by user (max: 3029200088); 1504/12350 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.2.179~2-41301]:PrintMemoryUsage():cu-allocator.cc:134) Time taken in cudaMallocPitch=0.584576, in cudaMalloc=0.00994802, in cudaFree=0.213893, in this->MallocPitch()=0.875278
WARNING (nnet3-chain-train[5.2.179~2-41301]:MallocPitchInternal():cu-allocator.cc:97) Allocation of 6968320 x 55 region failed: freeing some memory and trying again.
LOG (nnet3-chain-train[5.2.179~2-41301]:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 2967130800 bytes currently allocated (max: 4543694392); 2631947264 currently in use by user (max: 3029200088); 1504/12350 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.2.179~2-41301]:PrintMemoryUsage():cu-allocator.cc:134) Time taken in cudaMallocPitch=0.585493, in cudaMalloc=0.00994802, in cudaFree=0.214093, in this->MallocPitch()=0.875278
ERROR (nnet3-chain-train[5.2.179~2-41301]:MallocPitchInternal():cu-allocator.cc:114) Cannot allocate the requested memory (6968320 x 55 = 383257600 bytes)

---david

Daniel Povey

unread,
Oct 27, 2017, 10:39:00 AM10/27/17
to kaldi-help
Oh OK, I realize now that the CuTimer thing is by design, but we should change that code so it only prints the timing if it's been accumulated.
At this point there is no mystery.  You should reduce the num-chunk-per-minibatch.




To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages