Dear Dan!
The issue that I found is when I reinstalled kaldi, the GPU memory usage of the DNN training processes increased very much.
I've done some investigation to describe (and solve or avoid) the problem.
Normally my DNN training process uses 200-300 MB GPU memory, so nvidia-smi ( | grep nnet-train-simple) shows something like this:
| 0 6737 C nnet-train-simple 249MiB |
I've found out, that 8c0e3e3119a45d3cb8b4b60a62acecabf9154d91 ("Refactor CUDA allocator code based on large cached regions") is the commit where it turns into:
| 0 6980 C nnet-train-simple 4129MiB |
I tried the latest NVIDIA drivers and CUDA (410.48 / 10.0) as well as an older version of them (384.145 / 9.0), but it makes no difference. The GPU card is a GTX1070.
I logged the process of increasing the allocated memory by running nvidia-smi in every 0.25 second. The logging started just before running nnet-train-simple in the train_pnorm_fast.sh.
This is what I got in the first (good) case:
steps/nnet2/train_pnorm_fast.sh --num-threads 1 --num-jobs-nnet 1 --num-hidden-layers 6 --p 4 --pnorm-input-dim 2000 --pnorm-output-dim 400 --stage 0 --cmd
run.pl --cleanup false --minibatch-size 512 --initial-learning-rate 0.1 --final-learning-rate 0.01 data/train data/lang exp/tri3_ali exp/pnorm6x2000
steps/nnet2/train_pnorm_fast.sh: Will train for 15 + 5 epochs, equalling
steps/nnet2/train_pnorm_fast.sh: 1350 + 450 = 1800 iterations,
steps/nnet2/train_pnorm_fast.sh: (while reducing learning rate) + (with constant learning rate).
steps/nnet2/train_pnorm_fast.sh: Will not do mix up
Training neural net (pass 0)
| 0 6737 C nnet-train-simple 8MiB |
| 0 6737 C nnet-train-simple 8MiB |
| 0 6737 C nnet-train-simple 123MiB |
| 0 6737 C nnet-train-simple 147MiB |
| 0 6737 C nnet-train-simple 241MiB |
| 0 6737 C nnet-train-simple 249MiB |
| 0 6737 C nnet-train-simple 249MiB |
| 0 6737 C nnet-train-simple 249MiB |
| 0 6737 C nnet-train-simple 249MiB |
| 0 6737 C nnet-train-simple 249MiB |
And with the more recent kaldi version, at some point the memory usage jumps to ~4GB:
steps/nnet2/train_pnorm_fast.sh --num-threads 1 --num-jobs-nnet 1 --num-hidden-layers 6 --p 4 --pnorm-input-dim 2000 --pnorm-output-dim 400 --stage 0 --cmd
run.pl --cleanup false --minibatch-size 512 --initial-learning-rate 0.1 --final-learning-rate 0.01 data/train data/lang exp/tri3_ali exp/pnorm6x2000
steps/nnet2/train_pnorm_fast.sh: Will train for 15 + 5 epochs, equalling
steps/nnet2/train_pnorm_fast.sh: 1350 + 450 = 1800 iterations,
steps/nnet2/train_pnorm_fast.sh: (while reducing learning rate) + (with constant learning rate).
steps/nnet2/train_pnorm_fast.sh: Will not do mix up
Training neural net (pass 0)
| 0 6980 C nnet-train-simple 8MiB |
| 0 6980 C nnet-train-simple 13MiB |
| 0 6980 C nnet-train-simple 125MiB |
| 0 6980 C nnet-train-simple 149MiB |
| 0 6980 C nnet-train-simple 4129MiB |
| 0 6980 C nnet-train-simple 4129MiB |
| 0 6980 C nnet-train-simple 4129MiB |
| 0 6980 C nnet-train-simple 4129MiB |
| 0 6980 C nnet-train-simple 4129MiB |
| 0 6980 C nnet-train-simple 4129MiB |
| 0 6980 C nnet-train-simple 4129MiB |
I usually run 3 jobs per GPU (e.g. --num-jobs-nnet 18 on 6 GPUs) to minimize the overall run time. But this way only one process can fit in the GPU RAM. So now I use the version of the previous commit.
Is this some bug or should I configure something differently with the modified CUDA allocation?
Thanks,
András