Increased GPU memory usage of nnet-train-simple

203 views
Skip to first unread message

András Balog

unread,
Oct 3, 2018, 8:54:50 AM10/3/18
to kaldi-help
Dear Dan!

The issue that I found is when I reinstalled kaldi, the GPU memory usage of the DNN training processes increased very much.
I've done some investigation to describe (and solve or avoid) the problem.
Normally my DNN training process uses 200-300 MB GPU memory, so nvidia-smi ( | grep nnet-train-simple) shows something like this:
|    0      6737      C   nnet-train-simple                            249MiB |

I've found out, that 8c0e3e3119a45d3cb8b4b60a62acecabf9154d91 ("Refactor CUDA allocator code based on large cached regions") is the commit where it turns into:
|    0      6980      C   nnet-train-simple                           4129MiB |

I tried the latest NVIDIA drivers and CUDA (410.48 / 10.0) as well as an older version of them (384.145 / 9.0), but it makes no difference. The GPU card is a GTX1070.

I logged the process of increasing the allocated memory by running nvidia-smi in every 0.25 second. The logging started just before running nnet-train-simple in the train_pnorm_fast.sh.
This is what I got in the first (good) case:

steps/nnet2/train_pnorm_fast.sh --num-threads 1 --num-jobs-nnet 1 --num-hidden-layers 6 --p 4 --pnorm-input-dim 2000 --pnorm-output-dim 400 --stage 0 --cmd run.pl --cleanup false --minibatch-size 512 --initial-learning-rate 0.1 --final-learning-rate 0.01 data/train data/lang exp/tri3_ali exp/pnorm6x2000
steps/nnet2/train_pnorm_fast.sh: Will train for 15 + 5 epochs, equalling 
steps/nnet2/train_pnorm_fast.sh: 1350 + 450 = 1800 iterations, 
steps/nnet2/train_pnorm_fast.sh: (while reducing learning rate) + (with constant learning rate).
steps/nnet2/train_pnorm_fast.sh: Will not do mix up
Training neural net (pass 0)
|    0      6737      C   nnet-train-simple                              8MiB |
|    0      6737      C   nnet-train-simple                              8MiB |
|    0      6737      C   nnet-train-simple                            123MiB |
|    0      6737      C   nnet-train-simple                            147MiB |
|    0      6737      C   nnet-train-simple                            241MiB |
|    0      6737      C   nnet-train-simple                            249MiB |
|    0      6737      C   nnet-train-simple                            249MiB |
|    0      6737      C   nnet-train-simple                            249MiB |
|    0      6737      C   nnet-train-simple                            249MiB |
|    0      6737      C   nnet-train-simple                            249MiB |

And with the more recent kaldi version, at some point the memory usage jumps to ~4GB:

steps/nnet2/train_pnorm_fast.sh --num-threads 1 --num-jobs-nnet 1 --num-hidden-layers 6 --p 4 --pnorm-input-dim 2000 --pnorm-output-dim 400 --stage 0 --cmd run.pl --cleanup false --minibatch-size 512 --initial-learning-rate 0.1 --final-learning-rate 0.01 data/train data/lang exp/tri3_ali exp/pnorm6x2000
steps/nnet2/train_pnorm_fast.sh: Will train for 15 + 5 epochs, equalling 
steps/nnet2/train_pnorm_fast.sh: 1350 + 450 = 1800 iterations, 
steps/nnet2/train_pnorm_fast.sh: (while reducing learning rate) + (with constant learning rate).
steps/nnet2/train_pnorm_fast.sh: Will not do mix up
Training neural net (pass 0)
|    0      6980      C   nnet-train-simple                              8MiB |
|    0      6980      C   nnet-train-simple                             13MiB |
|    0      6980      C   nnet-train-simple                            125MiB |
|    0      6980      C   nnet-train-simple                            149MiB |
|    0      6980      C   nnet-train-simple                           4129MiB |
|    0      6980      C   nnet-train-simple                           4129MiB |
|    0      6980      C   nnet-train-simple                           4129MiB |
|    0      6980      C   nnet-train-simple                           4129MiB |
|    0      6980      C   nnet-train-simple                           4129MiB |
|    0      6980      C   nnet-train-simple                           4129MiB |
|    0      6980      C   nnet-train-simple                           4129MiB |

I usually run 3 jobs per GPU (e.g. --num-jobs-nnet 18 on 6 GPUs) to minimize the overall run time. But this way only one process can fit in the GPU RAM. So now I use the version of the previous commit.

Is this some bug or should I configure something differently with the modified CUDA allocation?

Thanks,

András

Daniel Povey

unread,
Oct 3, 2018, 11:44:29 AM10/3/18
to kaldi-help

It's recommended not to share the GPU between multiple training processes like that.  Possibly you could give them the options --use-gpu=wait (I don't know what the script level options for this are, as I haven't touched nnet2 in a long time).
However, to work around the problem you can change CuAllocatorOptions::memory_proportion from, say, 0.8 to 0.1.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/1f5f1141-2495-4b2f-8398-7552203f1ac0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

András Balog

unread,
Oct 3, 2018, 1:40:29 PM10/3/18
to kaldi-help
Thank you! So this is indeed a feature, not a bug. I'll do some performance tests with my old 3-job setup, the old allocator with 1 job, with the new allocator with 1 or more jobs, and I'll see which works faster. (or which doesn't work at all)

Note: In the cu-allocator.h, the comment says that the default value is 0.8, but as I see, the code actually sets it to 0.5.

András
Reply all
Reply to author
Forward
0 new messages